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by pure chance, the mixed strategy profile mm will be an e-Nash equilibrium, and then, 
since all players have a small expected regret, the process gets stuck with this value for 
a much longer time than the search period. The main technical result needed to justify 
such a statement is summarized in Lemma 7.5. This implies that if the parameters of the 
procedure are set appropriately, the length of the search period is negligible compared with 
the length of time the process spends in an e-Nash equilibrium. The proof of Lemma 7.5 is 
quite technical and is beyond the scope of this book. See the bibliographic remarks for the 
appropriate pointers. In addition, the proof requires certain properties of the game that are 
not satisfied by all games. However, the necessary conditions hold for almost all games, in 
the sense that the Lebesgue measure of all those games that do not satisfy these conditions 
is 0. (Here we consider the representation of a game as the K Me, N,-dimensional vector of 
all losses €“)(i).) Let N; = E \ M: denote the complement of the set of ¢-Nash equilibria. 


Lemma 7.5. For almost all K -person games there exist positive constants c1, C2 such that, 
for all sufficiently small p > 0, the K-step transition probabilities of experimental regret 
testing satisfy 


P(N, > N,) = c1p®, 


7.10 Convergence in Unknown Games 213 


where we use the notation P(A > B) = P[7tm4x € B | 2m € A] for the K -step transi- 
tion probabilities. 


On the basis of this lemma, we can now state one of the basic properties of the experimental 
regret-testing procedure. The result states that, in the long run, the played mixed strategy 
profile is not an approximate Nash equilibrium at a tiny fraction of time. 


Theorem 7.8. Almost all games are such that there exists a positive number £o and positive 
constants c,,...,¢4 such that for all e < &Q if the experimental regret-testing procedure is 
used with parameters 


1 
e (e, 1), ÀA <e”, dT > —-———_ 1 Dr 
p E(e,e +e") < E an = 3 —=p og (c4e ) 


then for all M > log(e/2)/ log — AX), 


Pu.) = Plomr ¢ Ne] < e. 


Proof. First note that by Corollary 7.2, 
Pu.) < OW.) + (1— 4%)", 


so that it suffices to bound the measure of M, under the stationary probability Q. To this 
end, first observe that, by the defining property of the stationary distribution, 


O(Np) = OWN p) PON p > No) + OINp)PON > Np). 
Solving for QWN,) gives 


PON, > Np) 
— PON, > N,) + PON, > Np) 


QN) = i (7.3) 


To derive a lower bound for the expression on the right-hand side, we write the elementary 
inequality 

ONDPON, > No) 

ON>) 
4 O(N, \ NPN, \ Ne > No) 
ON) 
_ QN)P ON, > No) 
~ ON>) l 


To bound P(N: —> N), note that if 7, € Me, then the expected regret of all players 
is at most £. Since the regret estimates r£ 7 are sums of T independent random variables 


taking values between 0 and 1 with mean at most £, Hoeffding’s inequality implies that 


PON, > Np) = 


(7.4) 


k = =e? . 
PrE, zo] see, i=l, Neo k=1,...,K. 


Then the probability that there is at least one player k and a strategy i, < Nx such that 
r > p is bounded by X$, Npe~220-*” = Ne~- Thus, with probability at least 


mM,ip ^ 
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a- (1-— Nee =), all players keep playing the same mixed strategy, and therefore 
P(N, > Nz) > (= AE (1 — Neee), 
Consequently, since p > £, we have PO (N; > Np) > P(N: —> Ne) and hence 
PON: > Np) = PON: > Ne) = PW: > N)“ 
> (1X (1 — Neo) > 1 — K? — NKe Te 


(where we assumed that à < 1 and N e7270} < 1). Thus, using (7.4) and the obtained 
estimate, we have 


_ LW 
~ ON) 
Next we need to show that, for proper choice of the parameters, P O(N p > N,) is 


sufficiently large. For almost all of K -person games, this follows from Lemma 7.5, which 
asserts that 


PON, > Ny) (1 = K?à — NK eTO), 


PON, > Np) = Cip® 
for some positive constants Cı and C2 that depend on the game. Hence, from (7.3) we 
obtain 


Cip? 
= —e)2\ QW.) ` 
— (1 — K2. — N Ke-7 0-8) oat + Cip©@ 


O(N,) = i 


It remains to estimate the measure QVV,)/Q(NV,). We need to show that the ratio is close 
to 1 whenever p — € < e. It turns out that one can show that, in fact, for almost every game 
there exists a constant C5 such that 


QIN.) Calo - e) 
ON) ~ pa” 


where C3 and C4 are positive constants depending on the game. This inequality is not 
surprising, but the rigorous proof of this statement is somewhat technical and is skipped 
here. It may be found in [126]. Summarizing, 


— e)C4 
ON.) = QW) (1 - oe) 
p 5 


3 (1- cent) 
p 5 


Cip” 
x a = 
I= (1 — KA — NK eM") (1 — SEM) + Cy 9 
for some positive constants C;,..., Cs. Choosing the parameters p, A, T with appropriate 
constants c,,..., C4, we have 


OW,) < &/2. 


If M is so large that (1 — AX)” < ¢/2, we have Py(N;) < £, as desired. W 
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Theorem 7.8 states that if the parameters of the experimental regret-testing procedure are 
set in an appropriate way, the mixed strategy profiles will be in an approximate equilibrium 
most of the time. However, it is important to realize that the theorem does not claim 
convergence in any way. In fact, if the parameters T, p, and A are kept fixed forever, the 
process will periodically abandon the set of e-Nash equilibria and wander around for a 
long time before it gets stuck in a (possibly different) ¢-Nash equilibrium. Then the process 
stays there for an even much longer time before leaving again. However, since the process 
{Tm} forms an ergodic Markov chain, it is easy to deduce convergence of the empirical 
frequencies of play. Specifically, we show next that if all players play according to the 
experimental regret-testing procedure, then the joint empirical frequencies of play converge 
almost surely to a joint distribution P that is in the convex hull of e-Nash equilibria. The 
precise statement is given in Theorem 7.9. 

Recall that for each t = 1,2,... we denote by 1P the pure strategy played by the 
kth player. i is drawn randomly according to the mixed strategy 7 whenever t € 
{mT + 1,...,(@-+ 1)T}. Consider the joint empirical distribution of plays P, defined by 


N en 
PD = 7D Maen, ie] [0 No 


Denote the convex hull of a set A by co(A). 


Remark 7.13. Recall that Nash equilibria and ¢-Nash equilibria 2 are mixed strategy 
profiles, that is, product distributions, and have been considered, up to this point, as elements 
of the set & of product distributions. However, a product distribution is a special joint 
distribution over the set Tid 1,..., Nx} of pure strategy profiles, and it is this “larger” 
space in which the convex hull of -Nash equilibria is defined. Thus, elements of the convex 
hull are typically not product distributions. (Recall that the convex hull of Nash equilibria 
is a subset of the set of correlated equilibria.) 


Theorem 7.9. For almost every game and for every sufficiently small ¢ > O, there exists a 
choice of the parameters (T , p, à) such that the following holds: there is a joint distribution 
P over the set of K -tuples i = (i, ..., ig ) of actions in the convex hull coWN,) of the set of 
é-Nash equilibria such that the joint empirical frequencies of play of experimental regret 
testing satisfy 


lim P, >P almost surely. 
t>oo 


Proof. fm = (nr) x --- x z) € E is a product distribution, introduce notation 
K 
Ý kr 
P(x, i) = | [ 2G), 
k=1 


where i = (i1, ..., ix). In other words, P(x, -) is a joint distribution over the set of action 
profiles i, induced by x. 
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First observe that, since at time ¢ the vector I, of actions is chosen according to the 
mixed strategy profile x/r], by martingale convergence, for every i, 


1 t 
P,(i) — z 2 P (msr) > 0 almost surely, 


Therefore, it suffices to prove convergence of 1 yo 1 P(Tis/T]-Ì). Since Tys/rj is 
unchanged during periods of length T, we obviously have 


too t 


_iy ye an IS l 
lim Aa ie og 2_ PD: 


m=1 


By Corollary 7.2, 


M 
1 
lim — Im = almost surely, 
M>œ M 2 ve y 
m= 


where 7 = f. z 2 dQ (r). (Recall that Q is the unique stationary distribution of the Markov 
process.) This, in turn, implies by continuity of P(z,i) in m that there exists a joint 
distribution P (i) = J. z P, i) dQ (r) such that, for all i, 


li LS p i) = Pi) Imost surel 
Mos ~ Tm, 1) = 1 almos surely. 


It remains to show that P € co(N,). 
Let ce’ < £ bea positive number such that the £’ blowup of co(V,,) is contained in co(W6), 
that is, 


[P € E : AP’ €co(N,) such that ||P — P'||; < £'} C coN). 


Such an <’ always exists for almost all games by Exercise 7.28. In fact, one may choose 
é’ = £ /c3 for a sufficiently large positive constant c3 (whose value depends on the game). 
Now choose the parameters (T, po, à) such that OW.) < £'. Theorem 7.8 guarantees 
the existence of such a choice. 
Clearly, 


P(x, D= f Pa daom = | P(x, dagen) + f P(x, i) dQ(z). 
x 


1 y 


E' E 


Since f w, PQ) dQ (7) € co(N,-), we find that the L; distance of P and co(N,’) satisfies 


di(P, coN)) < 


f, Prao <f dQ) = QN x) < e'. 
Ny Ng 


1 


By the choice of £' we indeed have P € co(N;). E 


Remark 7.14 (Convergence of the mixed strategy profiles). We only mention briefly that 
the experimental regret testing procedure can be extended to obtain an uncoupled strategy 
such that the mixed strategy profiles converge, with probability 1, to the set of Nash 
equilibria for almost all games. Note that we claim convergence not only of the empirical 
frequencies of plays but also of the actual mixed strategy profiles. Moreover, we claim 
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convergence to M and to the convex hull co(V,) of all e-Nash equilibria for a fixed £. The 
basic idea is to “anneal” experimental regret testing such that first it is used with some 
parameters (T1, 01, 41) fora number M, of periods of length 7; and then the parameters are 
changed to (T2, (2, 42) (by increasing T and decreasing p and À properly) and experimental 
regret testing is used for a number Mz >> M; of periods (of length T2), and so on. However, 
this is not sufficient to guarantee almost sure convergence, because at each change of 
parameters the process is reinitialized and therefore there is an infinite set of indices t such 
that g; is far away from any Nash equilibrium. A possible solution is based on “localizing” 
the search after each change of parameters such that each player limits its choice to a small 
neighborhood of the mixed strategy played right before the change of parameters (unless a 
player experiences a large regret in which case the search is extended again to the whole 
simplex). Another challenge one must face in designing a genuinely uncoupled procedure 
is that the values of the parameters of the procedure (i.e., Te, pe, Ae, and Me, £ = 1, 2,...) 
cannot depend on the parameters of the game, because by requiring uncoupledness we must 
assume that the players only know their payoff function but not those of the other players. 
We leave the details as an exercise. 


Remark 7.15 (Nongeneric games). All results of this section up to this point hold for 
almost every game. The reason for this restriction is that our proofs require an assumption 
of genericity of the game. We do not know whether Theorems 7.8 and 7.9 extend to all 
games. However, by a simple trick one can modify experimental regret testing such that 
the results of these two theorems hold for all games. The idea is that before starting to 
play, each player slightly perturbes the values of his loss function and then plays as if his 
losses were the perturbed values. For example, define, for each player k and pure strategy 
profile i, 


2d) = LPG) + Zis, 


where the Z; s are i.i.d. random variables uniformly distributed in the interval [—e, £]. 
Clearly, the perturbed game is generic with probability 1. Therefore, if all players play 
according to experimental regret testing but on the basis of the perturbed losses, then 
Theorems 7.8 and 7.9 are valid for this newly generated game. However, because for all k 
and i we have |@ (i) — €“(i)| < £, every e-Nash equilibrium of the perturbed game is a 
2e-Nash equilibrium of the original game. 


Finally, we show how experimental regret testing can be modified so that it can be played 
in the model of unknown games with similar performance guarantees. In order to adjust 
the procedure, recall that the only place in which the players look at the past is when they 
calculate the regrets 


mT mT 


1 1 
D (k) (Dyas % 
nip 7 T X e's) — T > £ (I, ig). 


s=(m—1)T+1 s=(m—1)T+1 


However, each player may estimate his regret in a simple way. Observe that the first term in 
the definition of ae is just the average loss player k over the mth period, which is available 
to the player, and does not need to be estimated. However, the second term is the average 
loss suffered by the player if he had chosen to play action ix all the time during this period. 


This can be estimated by random sampling. The idea is that, at each time instant, player 
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k flips a biased coin and, if the outcome is head (the probability of which is very small), 
then instead of choosing an action according to the mixed strategy 2“) the player chooses 
one uniformly at random. At these time instants, the player collects sufficient information 
to estimate the regret with respect to each fixed action ix. 

To formalize this idea, consider a period between times (m — 1)T + 1 and mT. During 
this period, player k draws ną samples for each i, = 1,..., Nx actions, where ny << T 
is to be determined later. Formally, define the random variables U;, € {0,1,..., Nx}, 
where, for s between (m — 1)T + 1, and mT, for each ix = 1,..., Nx, there are exactly 
ng values of s such that Uz s = ix, and all such configurations are equally probable; for 
the remaining s, U;,, = 0. (In other words, for each ix = 1,..., Ng, ng values of s are 
chosen randomly, without replacement, such that these values are disjoint for different 
i,’s.) Then, at time s, player k draws an action J as follows: conditionally on the past up 
to time s — 1, 


1® is distributed as Fos , Ugs =0 
7 equals i, if Uks = ig. 


The regret rí m.i May be estimated by 


ix 


mT mT 


; 1 1 
k k k- : 
PA SFIN. 5 CO )Mu,.=0) — a > COC, Dlui 
KNK mT 41 k s=(m-1)T+1 
k= 1,..., Ng. The first term of the definition of 7 a A is just the average of the losses 


of player k over those periods in which the player does not “experiment,” that is, when 
Ux s = 0. (Note that there are exactly T — Nng such periods.) Since Nng s B this 
average should be close to the first term in the definition of the average regret r, r®. The 
second term is the average over those time periods in which player k Sra. and 
he plays action ią (i.e., when Ug, s = i). This may be considered as an couma, obtained 
by sampling without replacement, of the second term in the definition of rí mi Observe 


that an only depends on the past payoffs experienced by player k, and therefore these 
estimates are feasible in the unknown game model. 

In order to show that the estimated regrets work in this case, we only need to establish 
that the probability that the estimated regret exceeds p is small if the expected regret is not 
more than £ (whenever € < p). This is done in the following lemma. It guarantees that if 
the experimental regret-testing procedure is run using the regret estimates described above, 
then results analogous to Theorems 7.8 and 7.9 may be obtained, in a straightforward way, 


in the unknown-game model. 


Lemma 7.6. Assume that in a certain period of length T, the expected regret 
a[r „0 |L, eke? Inr] of player k is at most £. Then, for a sufficiently small £, with the 


m,ik 
choice of parameters of Theorem 7.8, 


PFE, > p] er ies exp (-T'° (p — e)’) i 


Proof. We show that, with large probability, F i A is close to rË . To this end, first we 
compare the first terms in the expression of both. Observe that at Tose periods s of time 
when none of the players experiments (i.e., when Uçk s = 0 for all k = 1,..., K), the 
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corresponding terms of both estimates are equal. Thus, by a simple algebra it is easy to see 
that the first terms differ by at most 2 ae Nng. 


It remains to compare the second terms in the expressions of aes and pe as Observe 
that if there is no time instant s for which Uz s = 1 and Uy, s = 1 for some k’ Æ k, then 


t+T 


1 
— So OG, ily, .=i) 


£ s=t+1 
is an unbiased estimate of 


t+T 


I 
7, OW, id) 


s=t+l 


obtained by random sampling. The probability that no two players sample at the same time 
is at most 


Nng Nene 
T K? max kik DK SK 
kk<K T T 


’ 


where we used the union-of-events bound over all pairs of players and all T time instants. By 
Hoeffding’s inequality for an average of a sample taken without replacement (see Lemma 
A.2), we have 


| 


where P denotes the distribution induced by the random variables U; s. Putting everything 
together, 


t+T t+T 


1 z 1 Le, 
— 2 OG ilun — FD, COs id 


n 
k s= s=t+1 


= 2 
>a <e 2na , 


k) 
Pre = p] 
Nın N SN i 
Ak Neng 1 Neng 
< TK? kik an, 9 Žk=1 
Sen ft fo el (: $ T 
Choosing nę ~ T'/?, the first term on the right-hand side is of order T7! and 


+ pan Nng = O(T~7/3) becomes negligible compared with p — £. W 
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Regret-minimizing strategies, such as those discussed in Sections 4.2 and 4.3, set up the 
goal of predicting as well as the best constant strategy in hindsight, assuming that the 
actions of the opponents would have been the same had the forecaster been following that 
constant strategy. However, when a forecasting strategy is used to play a repeated game, 
the actions prescribed by the forecasting strategy may have an effect on the behavior of the 
opponents, and so measuring regret as the difference of the suffered cumulative loss and 
that of the best constant action in hindsight may be very misleading. 
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To simplify the setup of the problem, we consider playing a two-player game such that, 
at time rt, the row player takes an action 7, € {1,..., N} and the column player takes action 
J, € {1,..., M}. The loss suffered by the row player at time t is €(/;, J+). (The loss of the 
column player is immaterial in this section. Note also that since we are only concerned with 
the loss of the first player, there is no loss of generality in assuming that there are only two 
players, since otherwise J, can represent the joint play of all other players.) In the language 
of Chapter 4, we consider the case of a nonoblivious opponent; that is, the actions of the 
column player (the opponent) may depend on the history /;,..., /;-1 of past moves of the 
row player. 

To illustrate why regret-minimizing strategies may fail miserably in such a scenario, 
consider the repeated play of a prisoners’ dilemma, that is, a 2 x 2 game in which the loss 
matrix of the row player is given by 


R\C[ c d] 
T 
0 23 


In the usual definition of the prisoners’ dilemma, the column player has the same loss 
matrix as the row player. In this game both players can either cooperate (“c”) or defect 
(“d”). Regardless of what the column player does, the row player is better off defecting 
(and the same goes for the column player). However, it is better for the players if they both 
cooperate than if they both defect. 

Now assume that the game is played repeatedly and the row player plays according 
to a Hannan-consistent strategy; that is, the normalized cumulative loss 1 Y fC: Jt) 
approaches min;=c,a 1 Yai £(i, J+). Clearly, the minimum is achieved by action “d” and 
therefore the row player will defect basically all the time. In a certain worst-case sense 
this may be the best one can hope for. However, in many realistic situations, depending 
on the behavior of the adversary, significantly smaller losses can be achieved. For example, 
the column player may be willing to try cooperation. Perhaps the simplest such strategy 
of the opponent is “tit for tat,’ in which the opponent repeats the row player’s previous 
action. In such a case, by playing a Hannan consistent strategy, the row player’s performance 
is much worse than what he could have achieved by following the expert “c” (which is the 
worse action in the sense of the notions of regret we have used so far). 

The purpose of this section is to introduce forecasting strategies that avoid falling in traps 
similar to the one described above under certain assumptions on the opponent’s behavior. 

To this end, consider the scenario where, rather than requiring Hannan consistency, 
the goal of the forecaster is to achieve a cumulative loss (almost) as small as that of the 
best action, where the cumulative loss of each action is calculated by looking at what 
would have happened if that action had been followed throughout the whole repeated 
game. 

It is obvious that a completely malicious adversary can make it impossible to estimate 
what would have happened if a certain action had been played all the time (unless that 
action is played all the time). But under certain natural assumptions on the behavior of 
the adversary, such an inference is possible. The assumptions under which our goal can 
be reached require a kind of “stationarity” and bounded memory of the opponent and are 
certainly satisfied for simple strategies such as tit for tat. 
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Remark 7.16 (Hannan consistent strategies are sometimes better). We have argued that in 
some cases it makes more sense to look for strategies that perform as well as the best action 
if that action had been played all the time rather than playing Hannan consistent strategies. 
The repeated prisoners’ dilemma with the adversary playing tit for tat is a clear example. 
However, in some other cases Hannan consistent strategies may perform much better than 
the best action in this new sense. The following example describes such a situation: assume 
that the row player has N = 2 actions, and let n be even such that n/2 is odd. Assume 
that in the first n/2 time periods the losses of both actions are 1 in each period. After n/2 
periods the adversary decides to assign losses 00... 000 (1/2 times) to the action that was 
played less times during the first n/2 rounds and 11...111 to the other action. Clearly, a 
Hannan consistent strategy has a cumulative loss of about n /2 during the n periods of the 
game. On the other hand, if any of the two actions is played constantly, its cumulative 
loss is n. 


The goal of this section is to design strategies that guarantee, under certain assumptions on 
the behavior of the column player, that the average loss 1 yor, LCL, J+) is not much larger 
than min;=1,...,N Hin, Where UMi,n is the average loss of a hypothetical player who plays the 
same action J, = i in each round of the game. 

A key ingredient of the argument is a different way of measuring regret. The goal of 
the forecaster in this new setup is to achieve, during the n periods of play, an average loss 
almost as small as the average loss of the best action, where the average is computed over 
only those periods in which the action was chosen by the forecaster. To make the definition 


formal, denote by 


1 t 
i = - LU;, Js 
m pa 5) 


the averaged cumulative loss of the forecaster at time t and by 


Xs Us, JsMMu,=i) 

Desai I= 
the averaged cumulative loss of action 7, averaged over the time periods in which the action 
was played by the forecaster. If ey Iu, =; = 0, let ui, take the maximal value 1. At this 
point it may not be entirely clear how the averaged losses m; are related to the average 
loss of a player who plays the same action i all the time. However, shortly it will become 
clear that these quantities can be related under some assumptions of the behavior of the 


Kis = 


opponent and certain restrictions on the forecasting strategy. 

The property that the forecaster needs to satisfy for our purposes is that, asymptotically, 
the average loss /i, is not larger than the smallest asymptotic average loss j1;,,. More 
precisely, we need to construct a forecaster that achieves 

lim sup, < min limsup Hin- 
nso i=1,..N noo 
Surprisingly, there exists a deterministic forecaster that satisfies this asymptotic inequality 
regardless of the opponent’s behavior. Here we describe such a strategy for the case of 
N = 2 actions. The simple extension to the general case of more than two actions is left as 
an exercise (Exercise 7.30). Consider the following simple deterministic forecaster. 


222 Prediction and Playing Games 


DETERMINISTIC EXPLORATION-EXPLOITATION 
For each roundt = 1,2,... 


(1) (Exploration) if t = k? for an integer k, then set J; = 1; 
ift = k? + 1 for an integer k, then set J, = 2; 

(2) (Exploitation) otherwise, let J; = argmin;—1,2 Hi. -1 (in case of a tie, break it, 
say, in favor of action 1). 


This simple forecaster is a version of fictitious play, based on the averaged losses, 
in which the exploration step simply guarantees that every action is sampled infinitely 
often. There is nothing special about the time instances of the form t = k*, k? + 1; any 
sparse infinite sequence would do the job. In fact, the original algorithm of de Farias and 
Megiddo [85] chooses the exploration steps randomly. 

Observe, in passing, that this is a “bandit’’-type predictor in the sense that it only needs 
to observe the losses of the played actions. 


Theorem 7.10. Regardless of the sequence of outcomes Jı, J2, ... the deterministic fore- 
caster defined above satisfies 


lim sup Zi, < min lim sup kin. 
n> oo i=1,2 poo 


Proof. For each t=1,2,..., let i* = argmin,_;y Mi, and let ti, t2,... be the time 
instances such that if Æ iš į, that is, the “leader” is switched. If there is only a finite 
number of such ¢,’s, then, obviously, 


Un — min Mi, n > 0, 
i=l, 


which implies the stated inequality. Thus, we may assume that there is an infinite number 
of switches and it suffices to show that whenever T = max{t, : tg < n}, then either 


. const. 
Hn — MIN Mit < FIA (7.5) 
or 
; const 
HUn — ET Kin S T14 ’ (7.6) 


which implies the statement. 
First observe that, due to the exploration step, for any t > 3 andi = 1, 2, 


t 
X lum- = |vVt =T] > vt/2. 
s=l1 
But then 
2 
ltir — Har| < VF 


This inequality holds because by the boundedness of the loss, at time T, the averaged loss 
of action 7 can change by at most 1/ ee 1 lu= < 2/ VT, and the definition of the switch 
is that the one that was larger in the previous step becomes smaller, which is only possible 
if the averaged losses of the two actions were already 2/./T close to each other. But then 
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the averaged loss of the forecaster at the time T (the last switch before time n) may be 
bounded by 


T T 
Bs 1 
fir = 7 (> es, Ju, =) + DMs, 131) 
i T T 
— Ir Ir 
T (m: È (n=l) + 2,7 2 ua) 


< min pir + Se. 
i=1,2 


Now assume that T is so large that n — T < T?/+. Then clearly, |Z, — fr| < T?/4/n < 
T~'/4 and (7.5) holds. 

Thus, in the rest of the proof we assume that n — T > T 3/4, It remains to show that Ten 
cannot be much larger than min,=1,2 Mi,n. Introduce the notation 


ô = Un — min HiT - 
i=1,2 


Since 

T n 

fin = — (Zia J+ Yo eh, n) 
t=1 t=T+1 
1 ; : 
= (z min pir +2VT + J eh, n) 
t=T+1 

we have 


De ed, J) = (n — T) min p;,r + ôn — 2VT. 


t=T+1 


Since, apart from at most ~n — T exploration steps, the same action is played between 
times T + 1 and n, we have 


Hix, T yy m=) + Viera CU, J) — Vn = T 
Dai l= + (2 — T) 
ugr Doe Unig + erg C0, J) -vn -T 
Eiai l= + (2 — T) 
Miz,T Lii Tunis + (n — T) minjar2 wir + 6n — 2VT — Jn —T 
Dye lu- + 2 —T) 
Hiz, T (ey Iu,= + — T)) +ôn— 2T —vn-T 
p Dat l= +0- T) 
VT 1 
n—T n—T 
Sy peer 
= ün ok 


IV 


mn Li,n 


> Mitr +ô-—2 


where at the last inequality we used n — T > T?/4. Thus, (7.6) holds in this case. W 
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There is one more ingredient we need in order to establish strategies of the desired 
behavior. As we have mentioned before, our aim is to design strategies that perform well 
if the behavior of the opponent is such that the row player can estimate, for each action, 
the average loss suffered by playing that action all the time. In order to do this, we modify 
the forecaster studied above such that whenever an action is chosen, it is played repeatedly 
sufficiently many times in a row so that the forecaster gets a good picture of the behavior 
of the opponent when that action is played. This modification is done trivially by simply 
repeating each action t times, where the positive integer t is a parameter of the strategy. 


REPEATED DETERMINISTIC EXPLORATION-EXPLOITATION 
Parameter: Number of repetitions T. 
For each round t = 1,2,... 
(1) (Exploration) if t = k?t +s for integers k and s = 0, 1,..., t — 1, then set 
Ip= f; 
ift = (k? + 1)t + s for integers k ands = 0,1,...,7 — 1, then set I; = 2; 


(2) (Exploitation) otherwise, let I, = argmin;—1,2 Mi,tĮt/t]—1 (in case of a tie, break 
it, say, in favor of action 1). 


Theorem 7.10 (as well as Exercise 7.30) trivially extends to this case and the strategy 
defined above obviously satisfies 


lim sup Zi, < min lim sup Hi,n 
noo i=1,2 n> 


regardless of the opponent’s actions and the parameter t. 
Our main assumption on the opponent’s behavior is that, for every action 7, there exists 


a number jZ; € [0, 1] such that for any time instance ¢ and past plays /,..., Ir, 
1 t+T 
= JO UG, Js) T; < êr, 
t s=t41 


where £, is a sequence of nonnegative numbers converging to 0 as t —> oo. (Here the 
average loss is computed by assuming that the row player’s moves are /;,..., Ip, i, i,..., 7.) 
After de Farias and Megiddo [86], we call an opponent satisfying this condition flexible. 
Clearly, if the opponent is flexible, then for any action 7 the average loss of playing the 
action forever is at most //;. Moreover, the performance bound for the repeated deterministic 
exploration—exploitation immediately implies the following. 


Corollary 7.3. Assume that the row player plays according to the repeated deterministic 
exploration—exploitation strategy with parameter t against a flexible opponent. Then the 
asymptotic average cumulative loss of the row player satisfies 


li LS Ji) < min T; + 
imsup — a < min Hj + êr. 
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The assumption of flexibility is satisfied in many cases when the opponent’s long-term 
behavior against any fixed action can be estimated by playing the action repeatedly for a 
stretch of time of length t. This is satisfied, for example, when the opponent is modeled 
by a finite automata. In the example of the opponent playing tit for tat in the prisoners’ 
dilemma described at the beginning of this section, the opponent is clearly flexible with 
€, = 1/t. Note that in these cases one actually has 


1 
a ) LU, Js) — Ti < Er 
T 


| t+t 
s=t+1 


(with m; = 1/3 and m, = 2/3); that is, the estimated average losses are actually close to the 
asymptotic performance of the corresponding action. However, for Corollary 7.3 it suffices 
to require the one-sided inequality. 

Corollary 7.3 states the existence of a strategy of playing repeated games such that, 
against any flexible opponent, the average loss is at most that of the best action (calculated 
by assuming that the action is played constantly) plus the quantity £, that can be made 
arbitrarily small by choosing the parameter t of the algorithm sufficiently large. However, 
sequence €, depends on the opponent and may not be known to the forecaster. Thus, 
it is desirable to find a forecaster whose average loss actually achieves min;=1,...,N Mi 
asymptotically. Such a method may now easily be constructed. 


Corollary 7.4. There exists a forecaster such that whenever the opponent is flexible, 


1 n 
li -y «l, J) < min T. 
ae (L )S n 


We leave the details as a routine exercise (Exercise 7.32). 


Remark 7.17 (Randomized opponents). In some cases it may be meaningful to consider 
strategies for the adversary that use randomization. In such cases our definition of flexibility, 
which poses a deterministic condition on the opponent, is not realistic. However, the 
definition may be easily modified to accommodate a possibly randomized behavior. In fact, 
the original definition of de Farias and Megiddo [86] involves a probabilistic assumption. 


7.12 Bibliographic Remarks 


Playing and learning in repeated games is an important branch of game theory with an exten- 
sive literature. In this chapter we addressed only a tiny corner of this immense subject. The 
interested reader may consult the monographs of Fudenberg and Levine [119], Sorin [276], 
and Young [316]. Hart [144] gives an excellent survey of regret-based uncoupled learning 
dynamics. 

von Neumann’s minimax theorem is the classic result of game theory (see von Neumann 
and Morgenstern [296]), and most standard textbooks on game theory provide a proof. 
Various generalizations, including stronger versions of Theorem 7.1, are due to Fan [93] 
and Sion [271] (see also the references therein). The proof of Theorem 7.1 shown here is a 
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generalization of ideas of Freund and Schapire [114], who prove von Neumann’s minimax 
theorem using the strategy described in Exercise 7.9. 

The notion and the proof of existence of Nash equilibria appears in the celebrated 
paper of Nash [222]. For the basic results on the Nash convergence of fictitious play, see 
Robinson [246], Miyasawa [218], Shapley [265], Monderer and Shapley [220]. Hofbauer 
and Sandholm [163] consider stochastic fictitious play, similar, in spirit, to the follow- 
the-perturbed-leader forecaster considered in Chapter 4, and prove its convergence for a 
class of games. See the references within [163] for various related results. Singh, Kearns, 
and Mansour [270] show that a simple dynamics based on gradient-descent yields average 
payoffs asymptotically equivalent of those of a Nash equilibrium in the special case of 
two-player games in which both players have two actions. 

The notion of correlated equilibrium was first introduced by Aumann [16,17]. A direct 
proof of the existence of correlated equilibria, using just von Neumann’s minimax theorem 
(as opposed to the fixed point theorem needed to prove the existence of Nash equilibria) 
was given by Hart and Schmeidler [150]. The existence of adaptive procedures leading to 
a correlated equilibrium was shown by Foster and Vohra [105]; see also Fudenberg and 
Levine [118, 121] and Hart and Mas-Colell [145, 146]. Stoltz and Lugosi [278] generalize 
this to games with an infinite, but compact, set of actions. The connection of calibration and 
correlated equilibria, described in Section 7.6, was pointed out by Foster and Vohra [105]. 
Kakade and Foster [171] take these ideas further and show that if all players play according 
to a best response to a certain common, “almost deterministic,” well-calibrated forecaster, 
then the joint empirical frequencies of play converge not only to the set of correlated 
equilibria but, in fact, to the convex hull of the set of Nash equilibria. Hart and Mas- 
Colell [145] introduce a strategy, the so-called regret matching, conceptually much simpler 
than the internal regret minimization procedures described in Section 4.4, which has the 
property that if all players follow this strategy, the joint empirical frequencies converge 
to the set of correlated equilibria; see also Cahn [44]. Kakade, Kearns, Langford, and 
Ortiz [172] consider efficient algorithms for computing correlated equilibria in graphical 
games. The result of Section 7.5 appears in Hart and Mas-Colell [147]. 

Blackwell’s approachability theory dates back to [28], where Theorem 7.5 is proved. It 
was also Blackwell [29] who pointed out that the approachability theorem may be used to 
construct Hannan-consistent forecasting strategies. Various generalizations of this theorem 
may be found in Vielle [295] and Lehrer [193]. Fabian and Hannan [92] studied rates of 
convergence in an extended setting in which payoffs may be random and not necessarily 
bounded. The potential-based strategies of Section 7.8 were introduced by Hart and Mas- 
Colell [146] and Theorem 7.6 is due to them. In [146] the result is stated under a weaker 
assumption than convexity of the potential function. 

The problem of learning Nash equilibria by uncoupled strategies has been pursued by 
Foster and Young [108, 109]. They introduce the idea of regret testing, which the procedures 
studied in Section 7.10 are based on. Their procedures guarantee that, asymptotically, the 
mixed strategy profiles are within distance £ of the set of Nash equilibria in a fraction of 
at least 1 — £ of time. On the negative side, Hart and Mas-Colell [148, 149] show that it 
is impossible to achieve convergence to Nash equilibrium for all games if one is restricted 
to use stationary strategies that have bounded memory. By “bounded memory” they mean 
that there is a finite integer T such that each player bases his play only on the last T rounds 
of play. On the other hand, for every ¢ > 0 they show a randomized bounded-memory 
stationary uncoupled procedure, different from those presented in Sections 7.9 and 7.10, 
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for which the joint empirical frequencies of play converge almost surely to an e-Nash 
equilibrium. Germano and Lugosi [126] modify the regret testing procedure of Foster and 
Young to achieve almost sure convergence to the set of e-Nash equilibria for all games. 
The analysis of Section 7.10 is based on [126]. In particular, the proof of Lemma 7.5 is 
found in [126], though the somewhat simpler case of two players is shown in Foster and 
Young [109]. 

A closely related branch of literature that is not discussed in this chapter is based on 
learning rules that are based on players updating their beliefs using Bayes’ rule. Kalai and 
Lehrer [176] show that if the priors “contain a grain of truth,” the play converges to a Nash 
equilibrium of the game. See also Jordan [169, 170], Dekel, Fudenberg, and Levine [87], 
Fudenberg and Levine [117,119], and Nachbar [221]. 

Kalai, Lehrer, and Smorodinsky [177] show that this type of learning is closely related 
to stronger notions of calibration and merging. See also Lehrer, and Smorodinsky [196], 
Sandroni and Smorodinsky [258]. 

The material presented in Section 7.11 is based on the work of de Farias and Megiddo [85, 
86], though the analysis shown here is different. In particular, the forecaster of de Farias 
and Megiddo is randomized and conceptually simpler than the deterministic predictor used 
here. 
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7.1 Show that the set of all Nash equilibria of a two-person zero-sum game is closed and convex. 


7.2 (Shapley’s game) Consider the two-person game described by the loss matrices of the two 
players (“R” and “C”), known as Shapley’s game: 


R\C |} 1 2 3 R\C | 1 2 3 
1 0 1 1 1 1 0 1 
2 1 0 1 2 1 1 0 
3 1 1 0 3 0 1 1 


Show that if both players use fictitious play, the empirical frequencies of play do not converge 
to the set of correlated equilibria (Foster and Vohra [105]). 


7.3 Prove Lemma 7.1. 


7.4 Consider the two-person game given by the losses 


Rel? 2 R\C]1 2 
1 1 5 1 |1 0 
2 NO 2 l5 7 


Find all three Nash equilibria of the game. Show that the distribution given by P(1, 1) = 1/3, 
P(1,2) = 1/3, P(2, 1) = 1/3, P(2, 2) = 0 is a correlated equilibrium that lies outside of the 
convex hull of the Nash equilibria. (Aumann [17]). 


7.5 Show that a probability distribution P over Qi, {1,..., Ng} is a correlated equilibrium if and 
only if for all k = 1,..., K, 


E LPM) < EL, 7), 


where I = (J, ..., 7) is distributed according to P and the random variable 7 is any 
function of 7® and of a random variable U independent of I-. 

7.6 Consider the repeated time-varying game described in Remark 7.3, with N = M = 2. Assume 
that there exist positive numbers £, ô such that, for every sufficiently large n, at least for nô time 
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steps t between time 1 and n, 


max |& (1, j) — &(2, j)| > €. 
j=1,2 


Show that then for any sequence of mixed strategies p,, p,,... of the row player, the column 
player can choose his mixed strategies q,, qo, ... such that the row player’s cumulative loss 
satisfies 


Sa, 01- X mint (i, J) > yn 


t=1 


for all sufficiently large n with probability 1, where y is positive. 


Assume that in a two-person zero-sum game, for all t, the row player plays according to the 
constant mixed strategy p, = p, where p is any mixed strategy for which there exists a mixed 
strategy of the column player such that £(p, q) = V . Show that 


lim sup — LSe, J) <V. 


noo 
t=1 


Show also that, for any € > 0, the row player, regardless of how he plays, cannot guarantee that 


lim sup — Ly ih J)<V-e. 


n> 
t=1 


Consider a two-person zero-sum game and assume that the row player plays according to the 
exponentially weighted average mixed strategy 


exp (=n Ja ei, Js)) 
a exp (—n Jai £(k, Js)) f 


Show that, with probability at least 1 — ô, the average loss of the row player satisfies 


Pit = = Teg Ns 


ip nN n 2. 2N 
-XOU J) <V + +o4+,/—-In—. 
n nn 8 n ô 


Freund and Schapire [113] investigate the weighted average forecaster in the simplified version 
of the setup of Section 7.3, in which the row player gets to see the distribution q,_, chosen by 
the column player before making the play at time t. Then the following version of the weighted 
average strategy for the row player is feasible: 


Pit = ied eee \ oe 
Lihi exp (= Dia elk. a) 


with p; ı set to 1/N, where n > 0 is an appropriately chosen constant. Show that this strategy 
is an instance of the weighted average forecaster (see Section 4.2), which implies that 


sp, q) < mnt q) + an geI 


t=1 


= 


where 


N M 
Ea) =J Pir Tie LG j). 


i=l j=l 
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Show that if 7, denotes the actual randomized play of the row player, then with an appropriately 
chosen 7 = nr, 


im — (E l(1,, q) — „min Da q; ) =0 almost surely 


(Freund and Schapire [113]). 


Improve the bound of the previous exercise to 


H InN 
stp, q,) < < min (£z L(p, q,) ®) + = 4 m, 


t=1 


where H(p) = — ya ,Pilnp; denotes the entropy of the probability vector p= 
(Pi, ---, Pn). Hint: Improve the crude bound z In Ox erin) > max; Rin to F In 02 erin) > 
maxp (R; - p + H(p)/7). 

Consider repeated play in a two-person zero-sum game in which both players play such that 


lim — = lh, Ji) = almost surely. 


n> n 


Show that the product distribution P, x q,, with 


1 n 1 n 
aF SOlu- and Gin = 7 lua 
t=1 t=1 


converges, almost surely, to the set of Nash equilibria. Hint: Check the proof of Theorem 7.2. 


Robinson [246] showed that if in repeated playing of a two-person zero-sum game both play- 
ers play according to fictitious play (i.e., choose the best pure strategy against the average 
mixed strategy of their opponent), then the product of the marginal empirical frequencies 
of play converges to a solution of the game. Show, however, that fictitious play does not 
have the following robustness property similar to the exponentially weighted average strat- 
egy deduced in Theorem 7.2: if a player uses fictitious play but his opponent does not, 
then the player’s normalized cumulative loss may be significantly larger than the value of 
the game. 


(Fictitious conditional regret minimization) Consider a two-person game in which the loss 
matrix of both players is given by 
A2 t 2 
1 0 1 
2 1 0 


Show that if both players play according to fictitious play (breaking ties randomly if necessary), 
then Nash equilibrium is achieved in a strong sense. 

Consider now the “conditional” (or “internal”) version of fictitious play in which both players 
k = 1, 2 select 


t—1 


(k) _ (k) 
7 = argmin rei dH 19.21 Oy 5 ie). 


i, e {1,2} 


Show that if the play starts with, say, (1, 2), then both players will have maximal loss in every 
round of the game. 

Show that if in a repeated play of a K -person game all players play according to some Hannan 
consistent strategy, then the joint empirical frequencies of play converge to the Hannan set of 
the game. 
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Consider the two-person zero-sum game given by the loss matrix 
0 0 -lI 
0 0 1 
1 -1 0 
Show that the joint distribution 
13 13 0 
13 0 0 
0 0 0 


is a correlated equilibrium of the game. This example shows that even in zero-sum games the 
set of correlated equilibria may be strictly larger than the set of Nash equilibria (Forges [101]). 
Describe a game for which H \ C # Ø, that is, the Hannan set contains some distributions that 
are not correlated equilibria. 

Show that if P € H is a product measure, then P € M. In other words, the product measures 
in the Hannan set are precisely the Nash equilibria. 

Show that in a K-person game with N; = 2, for all k = 1,..., K (ie., each player has two 
actions to choose from), H = C. 

Extend the procedure and the proof of Theorem 7.4 to the general case of K -person games. 
Construct a game with two-dimensional vector-valued losses and a (nonconvex) set $ C R? 
such that all halfspaces containing S are approachable but S is not. 

Construct a game with vector-valued losses and a closed and convex polytope such that if the 
polytope is written as a finite intersection of closed halfspaces, where the hyperplanes defining 
the halfspaces correspond to the faces of the polytope, then all these closed halfspaces are 
approachable but the polytope is not. 

Use Theorem 7.5 to show that, in the setup of Section 7.4, each player has a strategy such that 
the limsup of the conditional regrets is nonpositive regardless of the other players’ actions. 
This exercise presents a strategy that achieves a significantly faster rate of convergence in 
Blackwell’s approachability theorem than that obtained in the proof of the theorem in the 
text. Let S be a closed and convex set, and assume that all halfspaces H containing S are 
approachable. Define Aj = 0 and A, = 1 yy @p,. Js) for t > 1. Define the row player’s 
mixed strategy p, at time ¢ = 1, 2,... as arbitrary if A,_, € S and by 


max | a1 LP, j) < Tri 


otherwise, where 


A1 — 1s5(A;_ = 
ai = Ani slAn) and T1 = a1 - 5(A;_1). 
|Ar-1 — 75(Ar-1) Il 


Prove that there exists a universal constant C (independent of n and d) such that, with probability 
at least 1 — 6, 


na D 


lA, — 75(ApIl < = +c 


Hint: Proceed as in the proof of Theorem 7.5 to show that || A, — s5(A,|| < 2/./n. To 
obtain a dimension-free constant when bounding ||A, — A,||, you will need an extension of 


the Hoeffding—Azuma inequality to vector-valued martingales; see, for example, Chen and 
White [58]. 
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Consider the potential-based strategy, based on the average loss A,_,, described at the beginning 
of Section 7.8. Show that, under the same conditions on S and ® as in Theorem 7.6, the average 
loss satisfies lim, +. d(A,, S) = 0 with probability 1. Hint: Mimic the proof of Theorem 7.6. 


(A stationary strategy to find pure Nash equilibria) Assume that a K -person game has a pure 
action Nash equilibrium and consider the following strategy for player k: If t = 1, 2, choose 
I ® randomly. If t > 2, if all players have played the same action in the last two periods (i.e., 
I,_; = L-2) and Be was a best response to I_,, then repeat the same play, that is, define 
IP =1 na Otherwise, choose /;” uniformly at random. 

Prove that if all players play according to this strategy, then a pure action Nash equilibrium 
is eventually achieved, almost surely. (Hart and Mas-Colell [149].) 
(Generic two-player game with a pure Nash equilibrium) Consider a two-player game with a 
pure action Nash equilibrium. Assume also that the player is generic in the sense that the best 
reply is always unique. Suppose at time f each player repeats the play of time t — 1 if it was a best 
response and selects an action randomly otherwise. Prove that a pure action Nash equilibrium 
is eventually achieved, almost surely [149]. Hint: The process I, Ip, ... is a Markov chain 
with state space {1,..., Ni} x {1,..., N2}. Show that given any state i = (i), i2), which is not 
a Nash equilibrium, the two-step transition probability satisfies 


P[I, is a Nash equilibrium | I,- = (i), i2)] > c 


for a constant c > 0. 


(A nongeneric game) Consider a two-player game (played by “R” and “C”) whose loss matrices 
are given by 
RCI 1 2 3 R\C | 1 2 3 
1 0 1 0 1 1 0 1 
2 1 0 0 2 0O 1 1 
3 1 1 0 3 0 0 0 


Suppose both players play according to the strategy described in Exercise 7.26. Show that there 
is a positive probability that the unique pure Nash equilibrium is never achieved. (This example 
appears in Hart and Mas-Colell [149].) 


Show that almost all games (with respect to the Lebesgue measure) are such that there exist 
constants c1, C2 > 0 such that for all sufficiently small £ > 0, the set M, of approximate Nash 
equilibria satisfies 


D3 (N, ce) C N: C DAW, c28), 


where DWN, £) = {x EL: |xr-T'llo<e,mwe€ N} is the Lœ neighborhood of the set of 
Nash equilibria, of radius £. (See, e.g., Germano and Lugosi [126].) 

Use the procedure of experimental regret testing as a building block to design an uncoupled 
strategy such that if all players follow the strategy, the mixed strategy profiles converge almost 
surely to a Nash equilibrium of the game for almost all games. Hint: Follow the ideas described 
in Remark 7.14 and the Borel—Cantelli lemma. (Germano and Lugosi [126].) 


Extend the forecasting strategy defined in Section 7.11 to the case of N > 2 actions such that, 
regardless of the sequence of outcomes, 
lim sup, < min limsup kin- 
noo i=l,- n>oo 
Hint: Place the N actions in the leaves of a rooted binary tree and use the original algorithm 


recursively in every internal node of the tree. The strategy assigned to the root is the desired 
forecasting strategy. 
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7.31 Assume that both players follow the deterministic exploration—exploitation strategy while play- 
ing the prisoners’ dilemma. Show that the players will end up cooperating. However, if their 
play is not synchronized (e.g., if the column player starts following the strategy at time t = 3), 
both players will defect most of the time. 

7.32 Prove Corollary 7.4. Hint: Modify the repeated deterministic exploration—exploitation forecaster 
properly either by letting the parameter t grow with time or by using an appropriate doubling 
trick. 
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Absolute Loss 


8.1 Simulatable Experts 


In this chapter we take a closer look at the sequential prediction problem of Chapter 2 in the 
special case when the outcome space is Y = {0, 1}, the decision space is D = [0, 1], and 
the loss function is the “absolute loss” £(p, y) = |p — y|. We have already encountered this 
loss function in Chapter 3, where it was shown that, for general experts, the absolute loss is 
in some sense the “hardest” among all bounded convex losses. We now turn our attention to 
a different problem: the characterization of the minimax regret V, (F) for the absolute loss 
and for a given class F of simulatable experts (recall the definition of simulatable experts 
from Section 2.9). 

In the entire chapter, an expert f means a sequence fi, f2...of functions f, : 
{0, 1¥ 7! — [0, 1], mapping sequences of past outcomes y'~! into elements of the deci- 
sion space. We use F to denote a class of (simulatable) experts f. Recall from Sec- 
tion 2.10 that a forecasting strategy P based on a class of simulatable experts is a 
sequence P1, P2, ... of functions P; : YT! —> D (to simplify notation, we often write P; 
instead of D,(y'—!)). Recall also that the minimax regret V,(F) is defined for the absolute 
loss by 


Vi(F) = int nee (Zo ) mni sO ) 
where L(y") = aa \p,(y'—!) — y,| is the cumulative absolute loss of the forecaster P 
and LO = 2. io" ~!) — y,| denotes the cumulative absolute loss of expert f. The 
infimum is taken over all forecasters P. (In this chapter, and similarly in Chapter 9, we 
find it convenient to make the dependence of the cumulative loss on the outcome sequence 
explicit; this is why we write L(y") for Ln.) 

Clearly, if the cardinality of F is |F| = N, then V, (F) < V,, where V0 is the 
minimax regret with N general experts defined in Section 2.10. As seen in Section 2.2, for all 
n and N, V®™ < ./(n/2)InN. Also, by the results of Section 3.7, sup, y Vi /Vnln N > 
1/v2. 

On the other hand, the behavior of V, (F) is significantly more complex as it depends 
on the structure of the class F. To understand the phenomenon, just consider a class of 
N = 2 experts that always predict the same except for t = n, when one of the experts 
predicts 0 and the other one predicts 1. In this case, since the experts are simulat- 
able, the forecaster may simply predict as both experts if ¢ < n and set p, = 1/2. It is 
easy to see that this forecaster is minimax optimal, and therefore V,(7) = 1/2, which is 
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significantly smaller than the worst-case bound ./(7/2) In 2. In general, intuitively, V,(F) 
is small if the experts are “close” to each other in some sense and large if the experts are 
“spread out.” 

The primary goal of this chapter is to investigate what geometrical properties of F 
determine the size of V, (F). In Section 8.2 we describe a forecaster that is optimal in the 
minimax sense, that is, it achieves a worst-case regret equal to V, (F). The minimax optimal 
forecaster also suggests a way of calculating V,(F) for any given class of experts, and this 
calculation becomes especially simple in the case of static experts (see Section 2.9 for 
the definition of a static expert). In Section 8.3 the minimax regret V,,(F) is characterized 
for static experts in terms of a so-called Rademacher average. Section 8.4 describes the 
possibly simplest nontrivial example that illustrates the use of this characterization. In 
Section 8.5 we derive general upper and lower bounds for classes of static experts in terms 
of the geometric structure of the class F. Section 8.6 is devoted to general (not necessarily 
static) classes of simulatable experts. This case is somewhat more difficult to handle as 
there is no elegant characterization of the minimax regret. Nevertheless, using simple 
structural properties, we are able to derive matching upper and lower bounds for some 
interesting classes of experts, such as the class of linear forecasters or the class of Markov 
forecasters. 


8.2 Optimal Algorithm for Simulatable Experts 


The purpose of this section is to present, in the case of simulatable experts, a forecaster that 
is optimal in the sense that it minimizes, among all forecasters, the worst-case regret 


sup (Z0")~ mint") 
yr e{0, 1} fEF 


that is, 


sp (ZO®- minty") = VP, 
y"e{0,1}" SEF 
where all losses — we recall it once more — are measured using the absolute loss. 

Before describing the optimal forecaster, we note that, since the experts are simulatable, 
the forecaster may calculate the loss of each expert for any particular outcome sequence. 
In particular, for all y” € {0, 1}”, the forecaster may compute inf sez L f(y"). 

We determine the optimal forecaster “backwards,” starting with Pn, and the prediction 
at time n. Assume that the first n — 1 outcomes y”~! have been revealed and we want 
to determine, optimally, the prediction Pa, = Pn(y"'). Since our goal is to minimize the 
worst-case regret, we need to determine Pp, to minimize 


max Zo" + (Pn, 0) — inf Lr(y"'0), 
fEF 
EO”) + &@y, 1) — inf LoD] l 
fEF 


where y”~!0 denotes the string of n bits whose first n — 1 bits are y”~! and the last bit 
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is 0. Minimizing this quantity is equivalent to minimizing 
Pn — inf L-(y"'0), 1 — P, — inf L(y” 'D}. 
max [7 au Came ae sO | 


Clearly, if we write A,(y") = — infer L f(y”), then this is achieved by 


0 if A,(y"~!0) > An(y” 11) +1 
T 1 if A,(y" 0) + 1 < A, (”7!1) 
5 An(y"!1) — An(y"!0) + 1 
5 otherwise. 


A crucial observation is that this expression of P, does not depend on the previous predic- 
tions P1,..., Pn—1. Define 


A10 E mi a — inf L;(y""!0), 1— p, — inf LOID}. 
1O) smm -max Pp m. fO” 0), Pp mi fo" I) 


This may be rewritten as 


AAG OS „min, max {Pn + An(y"'0), 1 = Pn + An(y" D}. 


So far we have calculated the optimal prediction at the last time instance p,,. Next we deter- 
mine optimally },_1, assuming that at time n the optimal prediction is used. Determining 
Pn—1 is clearly equivalent to minimizing 


max {L(y"~*) + €@y—1, 0) + An—1(9"20), 
LO?) + Ln, 1) + An-i(y” *D} 


or, equivalently, to minimizing 
a n—2 ~ n—2 
max {Pn—1 + An-i(y” °0), 1 — Pn- + An-i(y” 7D} . 


The solution is, as before, 


0 if Ap_1(y""-70) > Ani(y” 7) +1 
pe ae 1 if An—1(y"20) + 1< An-10"721) 
= An") — An-1("20) + 1 : 
5 otherwise. 


The procedure may be continued in the same way until we determine 7). 
Formally, given a class F of experts and a positive integer n, a forecaster whose worst- 
case regret equals V, (F) is determined by the following recursion. 
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MINIMAX OPTIMAL FORECASTER 
FOR THE ABSOLUTE LOSS 


Parameters: Class F of simulatable experts. 


1. (Initialization) A,(y") = — inf rer L f(y"). 
2. (Recurrence) Fort =n,n—1,..., 1. 


AO) = n mg {p + A070), [s P F AQ! 1)} 


and 
0 if A,;(y'!0) 
> AOT) +1 
P, = 1 if 4O10 + 1 
i < A4071) 
ATID — AOO + 1 
otherwise. 
2 
Note that the recurrence for A; may also be written as 
A,(y'7!0) if A,(y'7!0) > A,(y'!1) +1 
= AO" !1) if A1710) + 1 < A4;0*!1) 
Anio = i 8.1 
ae A,(y'!1) + A;(y'!0) 1 f (8.1) 
5 otherwise. 


The algorithm for calculating the optimal forecaster has an important by-product: the value 
Ao of the quantity A,_;(y’~!) at the last step (t = 1) of the recurrence clearly gives 


n 
Ao = max |) eny) + aon] = VF). 
t=1 

Thus, the same algorithm also calculates the minimal worst-case regret. In the next section 
we will see some useful consequences of this fact. 


8.3 Static Experts 


In this section we focus our attention on static experts. Recall that an expert f is called 
static if for all £ = 1, 2,... and y7! € {0, 1}7!, fO!) = ff € [0, 1]. In other words, 
static experts’ predictions do not depend on the past outcomes: they are fixed in advance. 
For example, the expert that always predicts 0 regardless of the past outcomes is static, but 
the expert whose prediction is the average of all previously seen outcomes is not static. 

The following simple technical result has some surprising consequences. The simple 
inductive proof is left as an exercise. 


Lemma 8.1. Let F be an arbitrary class of static experts. Then for all t = 1,...,n and 
yl € {0, YEL, 


[AOT D- AQT] <1. 


This lemma implies the following result. 
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Theorem 8.1. If F is a class of static experts, then 


n 1 f 4 
WA = 5-57 DL inf LO"). 
y"e{0, 1}” 


Proof. Lemma 8.1 and (8.1) imply that for all t = 1,..., 7, 


Ai(y'1) + Ary’ 10) + 1 
5 : 


AQ") = 


Applying this equation recursively, we obtain 


1 i n 
Ao = 5 X Any Peci 


yre{0,1}" 


Recalling that V, (F) = Ao and A,(y") = — inf fef L f(y”), we conclude the proof. W 


To understand better the behavior of the value V,,(F), it is advantageous to reformulate 
the obtained expression. Recall that, because experts are static, each expert f is represented 
by a vector (fi, ..., fn) € [0, 1]”, where f, is the prediction of expert f at time t. Since 
Ufan y) = If — yl, Ley”) = X; | ft — yr |. Also, the average over all possible outcome 
sequences appearing in the expression of V,,(#) may be treated as an expected value. To 
this end, introduce i.i.d. symmetric Bernoulli random variables Y;,...,Y, (ie., PIY, = 
0] = PLY, = 1] = 1/2). Then, by Theorem 8.1, 


VF) = 5 — | we Sou 
= 5-H %l) 
fap (3-1-1) 


=> renames A A tlio 8.2 
[6-9 


where o, = 1 — 2Y, are i.i.d. Rademacher random variables (i.e., with Plo, = 1] = 
Plo, = —1] = 1/2). Thus, V, (F) equals n times the Rademacher average 


R, (A) =E un Yaa] 


acA Nl 


associated with the set A of vectors of the form 


a = (a1, ..., an) = (1/2 — fi,..., 1/2 — fn), JEF. 


Rademacher averages are thoroughly studied objects in probability theory, and this will help 
us establish tight upper and lower bounds on V, (F) for various classes of static experts in 
Section 8.5. Some basic structural properties of Rademacher averages are summarized in 
Section A.1.8. 
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8.4 A Simple Example 


We consider a simple example to illustrate the usage of the formula (8.2). In Section 8.5 
we obtain general upper and lower bounds for V, (F) based on the same characterization 
in terms of Rademacher averages. 

Consider the case when F is the class of all “constant” experts, that is, the class of 
all static experts, parameterized by q € [0, 1], of the form f1 = (g,...,q). Thus each 
expert predicts the same number throughout the n rounds. The first thing we notice is that 
the class F is the convex hull of the two “extreme” static experts f® = 0 and f® = 1. 
Since the Rademacher average of the convex hull of a set equals that of the set itself (see 
Section A.1.8 for the basic properties of Rademacher averages), the identity (8.2) implies 
that V, (F) = V,(Fo), where Fo contains the two extreme experts. (One may also easily 
see that for any sequence y” of outcomes the expert minimizing the cumulative loss L p(y”) 
over the class F is one of the two extreme experts.) Thus, it suffices to find bounds for 
V,,(Fo). To this end recall that, by Theorem 2.2, 


in? 
Vi(Fo) < V2 < = ~ 0.58877. 


Next we contrast this bound with bounds obtained directly using (8.2). Since 5 = 0 and 
f = 1 for allt, 


V, (Fo) = : 7 nx [Yoo $-a pa ; x 
isi t=1 


Using the Cauchy—Schwarz inequality, we may easily bound this quantity from above as 
follows: 


Vi(Fo) = 


Observe that this bound has the same order of magnitude as the bound obtained by Theo- 
rem 2.2, but it has a slightly better constant. 

We may obtain a similar lower bound as an easy consequence of Khinchine’s inequality, 
which we recall here (see the Appendix for a proof). 


Lemma 8.2 (Khinchine’s inequality). Let a,,..., an be real numbers, and let 01, ..., On 
be i.i.d. Rademacher random variables. Then 


n 1 n 
J a,o;| > — ae 
z Al z 
Applying Lemma 8.2 to our problem, we obtain the lower bound 


Va (Fo) = Ve 


Summarizing the upper and lower bounds, for every n we have 


VilF) 95 


Jn 


0.3535 < 
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For example, for n = 100 there exists a prediction strategy such that for any sequence 
y1,---, Yioo the total loss is not more than that of the best expert plus 5, but for any 
prediction strategy there exists a sequence yj, ..., Y100 such that the regret is at least 3.5. 
The exact asymptotic value is also easy to calculate: V,(Fo)//n > 1/V2m ~ 0.3989 (see 
the exercises). 


8.5 Bounds for Classes of Static Experts 


In this section we use Theorem 8.1 and some results from the rich theory of empirical 
processes to obtain upper and lower bounds for V,,(F) for general classes F of static 
experts. 

Theorem 8.1 characterizes the minimax regret as the Rademacher average R,(A) of 
the set A = 1/2 — F, where 1/2 = (1/2,..., 1/2), and the class F of static experts is 
now regarded as a subset of R” (by associating, with each static expert f, the vector 
(fi. ---, fn) € [0, 1]"). There are various ways of bounding Rademacher averages. One is 
by using the structural properties summarized in Section A.1.8, another is in terms of the 
geometrical structure of the set A. To illustrate the first method, we consider the following 
example. 


Example 8.1. Consider the class F of static experts f = (f1,..., fn) such that f, = (1 + 


o(b,))/2 for any vector b = (bı, .. . , bn) satisfying ||b\|? = )0"_, b? < 42 for a constant 
à > 0 and 
e — e™* 
OO eer 


is the standard “sigmoid” function. In other words, F contains all experts obtained by 
“squashing” the elements of the unit ball of radius A into the cube [0, 1]”. Intuitively, the 
larger the à, the more complex F is, which should be reflected in the value of V, (F). Next we 
derive an upper bound that reflects this behavior. By Theorem 8.1, V, (F) = nR,(A), where 
A is the set of vectors of the form a = (a1, ..., an), with a; = o(b;)/2 with ye b? <2. 
By the contraction principle (see Section A.1.8), 


1 n X n 
R,(A) < —E| sup cibi | = —E| sup dibi |, 
2n i Ibli<à 3 2n | b:\bii<1 2 


where we used the fact that o is Lipschitz with constant 1. Now by the Cauchy—Schwarz 


inequality, we have 
n 
| sup obi | =E 
we l 


We have thus shown that V, (F) < 4./n/2, and this bound may be shown to be essentially 
tight (see Exercise 8.11). 


A way to capture the geometric structure of the class of experts F is to consider its 
covering numbers. The covering numbers suitable for our analysis are defined as follows. 
For any class F of static experts let No(¥,7) be the minimum cardinality of a set F, of 
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static experts (possibly not all belonging to F) such that for all f € F there exists a g € F, 
such that 


5 (fi — gr)" <r. 
t=1 


The following bound shows how V,,(#) may be bounded from above in terms of these 
covering numbers. 


Theorem 8.2. For any class F of static experts, 
Jn/2 
VSI SNFA 
0 


The result is a straightforward corollary of Theorem 8.1, Hoeffding’s inequality 
(Lemma 2.2), and the following classical result of empirical process theory. To state the 
result in a general form, consider a family {Ty JEF } of zero mean random variables 
indexed by a metric space (F, p). Let N,(#,1r) denote the covering number of the metric 
space F with respect to the metric p. The family is called subgaussian in the metric p 
whenever 


n [ent T9] < er e8) /2 
holds for any f,g € F and à > 0. The family is called sample continuous if for any 
sequence f™, f,... € F converging to some f € F, we have Tym — Tp — 0 almost 


surely. The proof of the following result is given in the Appendix. 


Theorem 8.3. If {Ty : fe F} is subgaussian and sample continuous in the metric p, then 


D/2 
| sp <12 : VinN,(Ff, €) de, 


SEF 


where D is the diameter of F. 


For completeness, and without proof, we mention a lower bound corresponding to The- 
orem 8.2. Once again, the inequality is a straightforward corollary of Theorem 8.1 and 
known lower bounds for the expected maximum of Rademacher averages. 


Theorem 8.4. Let F be an arbitrary class of static experts containing f and g such that 
fı = 0 and g, = 1 for allt =1,...,n. Then there exists a universal constant K > 0 such 
that 


V,(F) => K supr./In N2(F,r). 
r>0 


The bound of Theorem 8.4 is often of the same order of magnitude as that of Theorem 8.2. 
For examples we refer to the exercises. 

The minimax regret V, (F) for static experts is expressed as the expected value of the 
supremum of a Rademacher process. Such expected values have been studied and well 
understood in empirical process theory. In fact, Theorems 8.3 and 8.4 are simple versions 
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of classical general results (known as “Dudley’s metric entropy bound” and “Sudakov’s 
minoration”) of empirical process theory. There exist more modern tools for establishing 
sharp bounds for expectations of maxima of random processes, such as “majorizing mea- 
sures” and “generic chaining.” In the bibliographic comments we point the interested reader 
to some of the references. 


8.6 Bounds for General Classes 


All bounds of the previous sections are based on Theorem 8.1, a characterization of V, (F) 
in terms of expected suprema of Rademacher processes. Unfortunately, no such tool is 
available in the general case when the experts in the class F are not static. This section 
discusses some techniques that may come to rescue. 

We begin with a simple but very useful bound for expert classes that are subsets of 
“convex hulls” of just finitely many experts. 


Theorem 8.5. Assume that the class of experts F satisfies the following: there exist N 
experts f,..., f (not necessarily in F) such that for all f € F there exist convex 
coefficients qi, -~ qu = 0, with Yqj = 1, such that f(y") = aj fh) 
forallt =1,...,n and y'! € {0, 1¥7!. Then 


Vi(F) < J(n/2)InN. 


Proof. The key property is that for any bit sequence y” €e {0, 1}” and expert f = 
Di qj fU € F there exists an expert among f,..., fO? whose loss on y” is not 
larger than that of f. To see this, note that 


L0 =} IoT- vl 


t= 


n N ; 
= ` Yai fo) = 
t=1 | j=1 


n N 

=ð 4 D -yı 
t=1 j=l 
N n 

=} 4} Wor -yı 
j=l t=l 


N 
=o ajL po") 
j=l 


> min | Lro”). 


Thus, for all y” € {0, 1}", inf ser L f(y”) = minj=i,....w L f(y"). This implies that if D is 
the exponentially weighted average forecaster based on the finite class f,..., fO), then, 
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by Theorem 2.2, 


nln N 
TER Zo 


which completes the proof. E 
We now review two basic examples. 


Example 8.2 (Linear experts). As a first example, consider the class £L of kth-order 
autoregressive linear experts, where k > 2 is a fixed positive integer. Because each pre- 
diction of an expert f € £L is determined by the last k bits observed, we add an arbitrary 
prefix y_z41,..., Yo to the sequence y” to be predicted. We use y/_, to denote the resulting 
sequence of n + k bits. The class £; contains all experts f such that 


k 
OLD = oa yi 
i=1 


for some q1, ..., qx = 0, with De qi = 1. In other words, an expert in Lg predicts accord- 
ing to a convex combination of the k most recent outcomes in the sequence. Convexity of 
the coefficients q; assures that FOD € [0, 1]. Accordingly, for such experts the value 
V (F) is redefined by 


V,(F) =inf max L(y"_,) — inf L(y", ; 
n ) max. ( (Yt) FEF oi) 


where 


> 


Lot») = >> POTD- y 
t=1 


and L f(yj_,) is defined similarly. 


Corollary 8.1. For all positive integers n and k > 2, 


Vn (Ly) = = : 


Proof. The statement is a direct consequence of Theorem 8.5 if we observe that £x is the 
convex hull of the k experts f®,..., f defined by FOOGTI = yii = 1,...,k. M 


Example 8.3 (Markov forecasters). For an arbitrary k > 1, we consider the class Mg 
of kth order Markov experts defined as follows. The class M+ is indexed by the set 


[0, 11% so that the index of any f € Mz, is the vector œ = (ao, @,..., @2*_1), with œs € 
[0,1] for O < s < 2‘. If f has index a, then FOT) = a; for all 1 < t < n and for all 
ygi e€ {0, 1}'t*-! where s has binary expansion y;—x, ..., ¥;—-1- Because each prediction 


of a kth-order Markov expert is determined by the last k bits observed, we add a prefix 
Y_k+1,--+, yo to the sequence to predict in the same way we did in the previous example 
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for the autoregressive experts. Thus, the function f; is now defined over the set {0, 1}‘t*—!. 
Once more, Theorem 8.5 immediately implies a bound for V, (M+). 


Corollary 8.2. For any positive integers n and k > 2, 


2knln2 
VM < y = 


Next we study how one can derive lower bounds for V,,(F) for general classes of experts. 
Even though Theorem 8.1 cannot be generalized to arbitrary classes of experts, its analog 
remains true as an inequality. 


Theorem 8.6. For any class of experts F, 


V,(F) > E E D G - rad) a- 2r) , 


JEF t=1 


where Y,,..., Y, are independent Bernoulli (1/2) random variables. 


Proof. For any prediction strategy, if Y, is a Bernoulli (1/2) random variable, then 
i [D71 — Y,| = 1/2 for each y’~!. Hence, 


V,(F) = L(y") — inf L p(y" 
( ) = max, ( (y") ae re) ) 
> E| LO”) — inf L(Y” 
> E ) mi FC | 

n 


= -—E]| inf L(Y" 
2 int i | 


=E he G = fv") (a ar) . B 


JEF j=] 


We demonstrate how to use this inequality for the example of the class of linear forecasters 
described earlier. In fact, the following result shows that the bound of Corollary 8.3 is 
asymptotically optimal. 


Corollary 8.3. 


À 1 
lim inf lim inf Vatu) — 
k>œ n> nlnk RO) 


Proof. We only sketch the proof, the details are left to the reader as an easy exercise. By 
Theorem 8.6, if we write Z, = 1 — 2Y, fort = —k + 1,...,n, then 


n 1 n 
| sp yt = ajaz] > > ) E Zz ; 


fELx PA A E U oak, 


Va (Ly) = 


Nile 
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The proof is now similar to the proof of Theorem 3.7 with the exception that instead of the 
ordinary central limit theorem, we use a generalization of it to martingales. 
Consider the k-vector X, = (Xn,1,..-, Xn,k) of components 


By the Cramér—Wold device (see Lemma 13.11 in the Appendix) the sequence of vec- 
tors {X,} converges in distribution to a vector random variable N = (N,,..., Nx) if and 
only if pa iX n, converges in distribution to ye 1 4N; for all possible choices of the 
coefficients a1, ..., ag. Thus consider 


k : ` 
2 aiXni = v 2 Z: D 


It is easy to see that the sequence of random variables ./n Xn, i, n = 1,2, ..., forms a 
martingale with respect to the sequence of ø -algebras G, generated by Z_441,..., Z;. Fur- 
thermore, by the martingale central limit theorem (see, e.g., Hall and Heyde [140, Theorem 
3.2]), Rae aiX n, i converges in distribution, as n — ov, to a zero-mean normal random 
variable with variance pa ae Then, by the Cramér—Wold device, as n — ov, the vec- 
tor X„ converges in distribution to N = (Ni,..., Nx), where Ni, ..., Ng are independent 
standard normal random variables. The rest of the proof is identical to that of Lemma 13.11 
in the Appendix, except that Hoeffding’s inequality needs to be replaced by its analog for 


bounded martingale differences (see Theorem A.7). MH 


8.7 Bibliographic Remarks 


The forecaster presented in Section 8.2 appears in Chung [60], which also gives an optimal 
algorithm for nonsimulatable experts (see the exercises). See Chung [61] for many related 
results. The form of Khinchine’s inequality cited here is due to Szarek [281]. The proof 
given in the Appendix is due to Littlewood [204]. The example described in Section 8.4 was 
first studied in detail by Cover [68], and then generalized substantially by Feder, Merhav, 
and Gutman [95]. 

Theorem 8.3 is a simple version of Dudley’s metric entropy bound [91]. Theorem 8.4 
follows from a result of Sudakov [280]; see Ledoux and Talagrand [192]. Understanding 
the behavior of the expected maximum of random processes, such as the Rademacher 
process characterizing the minimax regret for classes of static experts, has been an active 
topic of probability theory, and the first important results go back to Kolmogorov. Some of 
the key contributions are due to Fernique [97,98], Pisier [235], and Talagrand [287]. We 
recommend that the interested reader consult the recent beautiful book of Talagrand [288] 
for the latest advances. 

The Markov experts appearing in Section 8.6 were first considered by Feder, Merhav, 
and Gutman [95]. See also Cesa-Bianchi and Lugosi [51], where the lower bounds of 
Section 8.6 appear. We refer to [51] for more information on upper and lower bounds for 
general classes of experts. 
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8.8 Exercises 


Calculate V®, VO, VS, and Vs. Calculate Vi (F), V2(F), and V3(F) when F contains two 
experts fı, = Oand fo, = 1 for all t. 


Prove or disprove 


sup V,(F)= Vi), 
F:|F\|=N 


Prove Lemma 8.1. Hint: Proceed with a backward induction, starting with t = n. Visualizing 
the possible sequences of outcomes in a rooted binary tree may help. 


Use Lemma 8.1 to show that if F is a class of static experts, then the optimal forecaster of 
Section 8.3 may be written as 


1 inf L —loy"-') — inf L wlyyrt 
POTD = 5 +E E FEF fO )— in fEF sO ] , 


2 
where Y4, ..., Y, are i.i.d. Bernoulli (1/2) random variables (Chung [60]). 
Give an appropriate modification of the optimal prediction algorithm of Section 8.2 for the case 


of general “non-simulatable” experts; that is, give a prediction algorithm that achieves the worst- 
case regret Vn ), (See Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire, and Warmuth [48] 
and Chung [60].) Warning: This exercise requires work. 


Show that Lemma 8.1 and Theorem 8.1 are not necessarily true if the experts in F are not static. 


Show that for the class of experts discussed in Section 8.4 


ATE 


lim = 3 
noo Jn /2n 
Consider the simple class of experts described in Section 8.4. Feder, Merhav, and Gutman [95] 
proposed the following forecaster. Let p, denote the fraction of times outcome 1 appeared in 
the sequence y;,..., y;-1. Then the forecaster is defined by P, = W;(p;), where for x € [0, 1], 


0 if x < 1/2 — €, 
W(x) = 1 ifx > 1/2+¢e, 
1/2+ (x — 1/2)/@e,;) otherwise 


and £; > 0 is some positive number. Show that if e, = 1/(2./t + 2), then for all n > 1 and 
y” € {0, 1}”, 


x : 1 
L(y") — min L;(y") <vVn+14+ 5 
(See [95].) Hint: Show first that among all sequences containing nı < n/2 1’s the forecaster 


performs worst for the sequence which starts alternating 0’s and 1’s n; times and ends with 
n — 2n; 0’s. 


After reading the previous exercise you may wonder whether the simpler forecaster defined by 
0 ifx <1/2 
w(x) = 1 ifx>1/2 
1/2 ifx = 1/2 


also does the job. Show that this is not true. More precisely, show that there exists a sequence 
y” such that To”) — min;=1,2 L;(y") ~ n/4 for large n. (See [95].) 

Let F be the class of all static experts of the form f, = p regardless of t. The class contains all 
such experts with p € [0, 1]. Estimate the covering numbers N2(F,r) and compare the upper 
and lower bounds of Theorems 8.2 and 8.4 for this case. 
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8.11 


8.12 


8.13 


8.14 
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Use Theorem 8.4 to show that the upper bound derived for V,,(F) in Example 8.1 is tight up to 
a constant. 

Consider the class F of all static experts that predict in a monotonic way; that is, for each 
f € Ff, either fi < fiz: forallt > lor fi > fia: for allt > 1. Apply Theorem 8.2 to conclude 
that 


VF) = 0 (Vnlogn). 


What do you obtain using Theorem 8.4? Can you apply Theorem 8.5 in this case? 
Construct a forecaster such that for all k = 1,2,... and all sequences y1, y,..., 
1 (x 

li —(L(y")—- inf LQ”) | =0. 

isip ( G= t EO ) 
In other words, the forecaster predicts asymptotically as well as any Markov forecaster of any 
order. (The existence of such a forecaster was shown by Feder, Merhav, and Gutman [95].) 
Hint: For an easy construction use the doubling trick. 
Show that 

VM) 1 


lim inf lim inf E 


k> n>% /2knIn2 v2 
Hint: Mimic the proof of Corollary 8.3. 


9 


Logarithmic Loss 


9.1 Sequential Probability Assignment 


This chapter is entirely devoted to the investigation of a special loss function, the loga- 
rithmic loss, sometimes also called self-information loss. The reason for this distinguished 
attention is that this loss function has a meaningful interpretation in various sequential 
decision problems, including repeated gambling and data compression. These problems are 
briefly described later. Sequential prediction aimed at minimizing logarithmic loss is also 
intimately related to maximizing benefits by repeated investment in the stock market. This 
application is studied in Chapter 10. 

Now we describe the setup for the whole chapter. Let m > 1 be a fixed positive integer 
and let the outcome space be Y = {1,2,...,m}. The decision space is the probability 
simplex 


D= {p= (P0)... pm): pO =1, p20, fats m} CR". 
j=1 


A vector p € D is often interpreted as a probability distribution over the set VY. Indeed, 
in some cases, the forecaster is required to assign a probability to each possible outcome, 
representing the forecaster’s belief. For example, weather forecasts often take the form “the 
possibility of rain is 40%.” 

In the entire chapter we consider the model of simulatable experts introduced in Sec- 
tion 2.9. Thus, an expert f is a sequence (f1, f2, . . .) of functions f, : YT! > D, so that after 
having seen the past outcomes y’~!, expert f outputs the probability vector f,(- | y7!) € D. 
Aft = 1, fi = (fil), ..., fi(m)) is simply an element of D.) For the components of this 
vector we write 


FAIT,- fmn yD. 


This notation emphasizes the analogy between an expert and a probability distribution. 
Indeed, the jth component of the vector f,(j | y’~!) may be interpreted as the conditional 
probability f assigns to the jth element of Y given the past y’—!. 

Similarly, at each time instant the forecaster chooses a probability vector 


PE IyTD = (PA 1 y7, ..., Pan | y7). 
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Once again, using the analogy with probability distributions, we may introduce, for all 
n> land y” € y”, the notation 


fro" =[[AG 1D, hoD = [RO Ly’). 


t=1 t=1 
Observe that 
LAOS) P0”) =1 
yr ey" yey" 


and therefore expert f, as well as the forecaster, define probability distributions over the 
set of all sequences of length n. Conversely, any probability distribution p, (y”) over the 
set Y” defines a forecaster by the induced conditional distributions 


Pi(y') 


POLY Se 
ee Pr—-1(y!) 


where Pi O”) = yy ee Pn Q”). 
The loss function we consider throughout this chapter is defined by 


1 1 
n—~ = ln ; 
Ps) pO) 


€p. y) = J psn! yey,peD. 
j=l 


It is clear from the definition of the loss function that the goal of the forecaster is to assign 
a large probability to the outcomes in the sequence. For an outcome sequence yj,..., Yn 
the cumulative loss of expert f and the forecaster are, respectively, 


n 1 2 n 
Ee De rreaeren and L(y") = on 


t=1 =I. 


Ply: y5 


The cumulative loss of an expert f may also be written as L f(y”) = — In fah”). (Similarly, 
L(y") = — InP, (y").) In other words, the cumulative loss is just the negative log likelihood 
assigned to the outcome sequence by the expert f. Given a class F of experts, the difference 
between the cumulative loss of the forecaster and that of the best expert, that is, the regret, 
may now be written as 


n 


A 7 1 1 
L(y") — inf L p(y") = X` ln ——— — inf X` In ——_ 
fer T Ds PrO | y7) nee fix | yy) 


t=1 t=1 


z hO 
= sup In = ; 
feF Pn”) 


The regret may be interpreted as the logarithm of the ratio of the total probabilities that are 
sequentially assigned to the outcome sequence by the forecaster and the experts. 


Remark 9.1 (Infinite alphabets). We restrict the discussion to the case when the out- 
come space is a finite set. Note, however, that most results can be extended easily to the 
more general case when Y is a measurable space. In such cases the decision space becomes 
the set of all densities over Y with respect to some fixed common dominating measure, and 
the loss is the negative logarithm of the density evaluated at the outcome. 
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We begin the study of prediction under the logarithmic loss by considering mixture 
forecasters, the most natural predictors for this case. In Section 9.3 we briefly describe 
two applications, gambling and sequential data compression, in which the logarithmic loss 
function appears in a natural way. These applications have served as a main motivation for 
the large body of work done on prediction using this loss function. A special property of 
the logarithmic loss is that the minimax optimal predictor can be explicitly determined for 
any class F of experts. This is done in Section 9.4. In Sections 9.6 and 9.7 we discuss, in 
detail, the special case when F contains all constant predictors. Two versions of mixture 
forecasters are described and it is shown that their performance approximates that of 
the minimax optimal predictor. Section 9.8 describes another phenomenon specific to the 
logarithmic loss. We obtain lower bounds conceptually stronger than minimax lower bounds 
for some special yet important classes of experts. In Section 9.9 we extend the setup by 
allowing side information taking values in a finite set. In Section 9.10 a general upper 
bound for the minimax regret is derived in terms of the geometrical structure of the class 
of experts. The examples of Section 9.11 show how this general result can be applied to 
various special cases. 


9.2 Mixture Forecasters 


Recall from Section 3.3 that the logarithmic loss function is exp-concave for n < 1 and 
Theorem 3.2 applies. Thus, if the class F of experts is finite and |F| = N, then the 
exponentially weighted average forecaster with parameter 7 = 1 satisfies 


L(y") — inf L(y") < InN 
Q”) a fr") Sin 
or, equivalently, 
a n 1 n 
PrO”) = — sup faQ”). 
N fef 


It is worth noting that the exponentially weighted average forecaster has an interesting 
interpretation in this special case. Observe that the definition 


Ejer fir | yt Yes) 
Ejer ro 
of the exponentially weighted average forecaster, with parameter n = 1, reduces to 
3 O: l yi! = rer fOr | YTD fs") = rer fio) 
t = S . 
È jer f1 0O07) È jer f-10"') 


Thus, the total probability the forecaster assigns to a sequence y” is just 


BO ly) = 


Be tii hast “ Z rer AO = È rer fn") 
PrO R Safe OW 


(recall that f(y°) is defined to be equal to 1). In other words, the probability distribution 
the exponentially weighted average forecaster P defines over the set Y” of all strings of 
length n is just the uniform mixture of the distributions defined by the experts. This is 
why we sometimes call the exponentially weighted average forecaster mixture forecaster. 
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Interestingly, for the logarithmic loss, with n = 1, the mixture forecaster coincides with 
the greedy forecaster, and it is also an aggregating forecaster, as we show in Sections 3.4 
and 3.5. 

Recalling that the aggregating forecaster may be generalized for a countable number 
of experts, we may consider the following extension. Let f®, f,... be the experts of 
a countable family F. To define the mixture forecaster, we assign a nonnegative number 
T; > 0 to each expert f € F such that X7; 2; = 1. Then the aggregating forecaster 
becomes 


Sat Ti OO; | yg DFG) 
Dr FOD 

i = (yt! 

epee Oly ee 


oo pk cn ) 
Tje 7 


BO ly) = 


Now it is obvious that the joint probability the forecaster P assigns to each sequence y” is 


[o6] 
ProD = J m FO. 
i=1 
Note that p indeed defines a valid probability distribution over Y”. Using the trivial bound 
Pay") = oe, m; FOO") = me fOQ”) for all k, we obtain, for all y” € Y”, 


L(y") <_ inf (Lm +In 2) , 
7132.5 Ti 
This inequality is a special case of the “oracle inequality” derived in Section 3.5. 

Because of their analogy with mixture estimators emerging in bayesian statistics, the 
mixture forecaster is sometimes called bayesian mixture or bayesian model averaging, and 
the “initial” weights 2; prior probabilities. Because our setup is not bayesian, we avoid 
this terminology. 

Later in this chapter we extend the idea of a mixture forecaster to certain uncountably 
infinite classes of experts (see also Section 3.3). 


9.3 Gambling and Data Compression 


Imagine that we gamble in a horse race in which m horses run repeatedly many times. In the 
tth race we bet our entire fortune on the m horses according to proportions P,(1), .. . , Di (m), 
where the P, (j) are nonnegative numbers with D P:(j) = 1. If horse j wins the rth race, 
we multiply our money bet on this horse by a factor of o,(j) and we lose it otherwise. The 
odds o,(j) are arbitrary positive numbers. In other words, if y, denotes the index of the 
winning horse of the tth race, after the tth race we multiply our capital by a factor of 


V Iy,<n Bi Dori) = PO). 


j=l 


To make it explicit that the proportions p;(j) according to which we bet may depend on 
the results of previous races, we write p;,(j | y’~!). If we start with an initial capital of C 
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units, our capital after races is 


C I] Pir | y)or(yr). 


t=1 


Now assume that before each race we ask the advice of a class of experts and our goal is 
to win almost as much as the best of these experts. If expert f divides his capital in the rth 
race according to proportions f;(j | y’~'), j = 1,...,m, and starts with the same initial 
capital C, then his capital after n races becomes 


CT [fOr | yor). 
t=1 


The ratio between the best expert’s money and ours is thus 


SUP per C ies SiOx | yor) i fro”) 


CT BO ly Yow) per Pn”) 


independently of the odds. The logarithm of this quantity is just the difference of the 
cumulative logarithmic loss of the forecaster P and that of the best expert described in 
the previous section. In Chapter 10 we discuss in great detail a model of gambling (i.e., 
sequential investment) that is more general than the one described here. We will see that 
sequential probability assignment under the logarithmic loss function is the key in the more 
general model as well. 

Another important motivation for the study of the logarithmic loss function has its roots 
in information theory, more concretely in lossless source coding. Instead of describing the 
sequential data compression problem in detail, we briefly mention that it is well known 
that any probability distribution f,, over the set Y” defines a code (the so-called Shannon— 
Fano code), which assigns, to each string y” € VY”, a codeword, that is, a string of bits 
of length A,(y") = [— log, fa(y”)|. Conversely, any code with codeword lengths A,,(y”), 
satisfying a natural condition (i.e., unique decodability), defines a probability distribution 
by fry") = 277097 E aey 274", Given a class of codes, or equivalently, a class F 
of experts, the best compression of a string y” is achieved by the code that minimizes the 
length [— log, fn(y")], which is approximately equivalent to minimizing the logarithmic 
cumulative loss L ¢(y"). Now assume that the symbols of the string y” are revealed one by 
one and the goal is to compress it almost as well as the best code in the class. It turns out 
that for any forecaster P, (i.e., sequential probability assignment) it is possible to construct, 
sequentially, a codeword of length about — log, D,(y”) using a method called arithmetic 
coding. The regret 


i" Fx 
— log, PO”) — inf (— log, AO») = — ( L(y") — inf L(y" 
og Pa”) int ( ogs fn(y")) a3 fo» int LQ ) 


is often called the (pointwise) redundancy of the code with respect to the class of codes 
F. In this respect, the problem of sequential lossless data compression is equivalent to 
sequential prediction under the logarithmic loss. 
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9.4 The Minimax Optimal Forecaster 


In this section we investigate the minimax regret, defined in Section 2.10 for the logarithmic 
loss. An important distinguishing feature of the logarithmic loss function is that the minimax 
optimal forecaster can be determined explicitly. This fact facilitates the investigation of the 
minimax regret, which, in turn, serves as a standard to which the performance of any 
forecaster should be compared. Recall that for a given class F of experts, and integer 
n > 0, the minimax regret is defined by 


ze su nly” 
V,(F) = inf sup (Zo — inf Ly") = inf sup In patos 
D yneyn fEF P yneyn Pa") 


If for a given forecaster P we define the worst-case regret by 


Vi, F) = sup (Zo — inf Lo”) ; 
yreyn JEF 

then V, (F) = infp V, (P, F). Interestingly, in the case of the logarithmic loss it is possible 

to identify explicitly the unique forecaster achieving the minimax regret. Theorem 9.1 

shows that the forecaster p* defined by the normalized maximum likelihood probability 

distribution 


SUP fer fry”) 
ney SUP pef Jaa”) 


PAW = 5 


has this property. Note that p% is indeed a probability distribution over the set Y”, and recall 
that this probability distribution defines a forecaster by the corresponding conditional 
probabilities p*(y, | y7»). 


Theorem 9.1. For any class F of experts and integer n > 0, the normalized maximum 
likelihood forecaster p* is the unique forecaster such that 


sup (Zo — inf L 70") = V,(F). 
yey fEF 


Moreover, p* is an equalizer; that is, for all y" € Y", 


m PLE FAO n sup ful") = Val. 


Pry") peter 


Proof. First we show the second part of the statement. Note that by the definition of p*, 
its cumulative loss satisfies 


supper faQ”) 
Diy") 
=In D> sup f(x"), 


wren SEF 


L(y") — inf L(y”) = 1 
(y") m f¢Q") = In 


which is independent of y”, so p* is indeed an equalizer. 
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To show that p* is minimax optimal, let p # p* be an arbitrary forecaster. Then since 


Dynegy PrO”) = Viyneyn PhO”) = 1, for some y” € Y” we must have p(y") < pO”). 
But then, for this y”, 


sup rer fn(y") sup per fn(y") 
n >In 
PrO”) PO”) 


= const. = V, (p*, F) 
by the equalizer property. Hence, 


su nly” 
inept em. 
yreyn Pr”) 


which proves the theorem. W 


In Section 2.10 we show that, under general conditions satisfied here, the maximin regret 


su nO”) 
U, (F) = sup inf 5 {Gi areno 
q4 P yreyr PrO”) 


equals the minimax regret V, (F). It is evident that the normalized maximum likelihood 
forecaster p* achieves the maximin regret as well, in the sense that for any probability 
distribution q over y”, 


ae, S faly”) 
U,(F) = sup 5 q Jn A i, 
d yneyn Ply") 


Even though we have been able to determine the minimax optimal forecaster explicitly, 
note that the practical implementation of the forecaster may be problematic. First of all, 
we determined the forecaster via the joint probabilities it assigns to all strings of length n, 
and calculation of the actual predictions p* (y; | y’~!) involves sums of exponentially many 
terms. 

It is important to point out that previous knowledge of the total length n of the sequence 
to be predicted is necessary to determine the minimax optimal forecaster p*. Indeed, it is 
easy to see that if the minimax optimal forecaster is determined for a certain horizon n, 
then the forecaster is not the extension of the minimax optimal forecaster for horizon n — 1, 
even for nicely structured classes of experts. (See Exercise 9.4.) 

Theorem 9.1 not only describes the minimax optimal forecaster but also gives a use- 
ful formula for the minimax regret V, (F), which we study in detail in the subsequent 
sections. 


9.5 Examples 


Next we work out a few simple examples to understand better the behavior of the minimax 
regret V, (F). 
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Finite Classes 
To start with the simplest possible case, consider a finite class of experts with |F| = N. 
Then clearly, 


Vi(F) =In D> sup fO”) 


yreyr FEF 


aia $ 0” 


yrey" feF 
=i 20) 
SEF yey” 
= InN. 


Of course, we already know this. In fact, the mixture forecaster described in Section 9.1 
achieves the same bound. In Exercise 9.3 we point out that this upper bound cannot be 
improved in the sense that there exist classes of N experts such that the minimax regret 
equals In N. With the notation introduced in Section 2.10, V = InN (ifn > log, N). 

The mixture forecaster has obvious computational advantages over the normalized max- 
imum likelihood forecaster and does not suffer from the “horizon-dependence” of the latter 
mentioned earlier. These bounds suggest that one does not lose much by using the simple 
uniform mixture forecaster instead of the optimal but horizon-dependent forecaster. We 
will see later that this fact remains true in more general settings, even for certain infinite 
classes of experts. However, as it is pointed out in Exercise 9.7, even if F is finite, in 
some cases V, (F) may be significantly smaller than the worst-case loss achievable by any 
mixture forecaster. 


Constant Experts 

Next we consider the class F of all experts such that f,(j | y’~!) = f(/) (with f(j) > 0, 
ae f(j) = 1) for each f € F and independently of t and y’~!. In other words, F 
contains all forecasters f, so that the associated probability distribution over Y” is a product 
distribution with identical components. Here we only consider the case when m = 2. 
This simplifies the notation and the calculations while all the main ideas remain present. 
Generalization to m > 2 is straightforward, and we leave the calculations as exercises. 

If m = 2, (i.e, Y = {1,2} and D = f(q,1 — q) € R? : q € [0, 1]}), each expert in 
F may be identified with a number q € [0, 1] representing q = f(1). Thus, this expert 
predicts, at each time t, according to the vector (q, 1 — q4) € D, regardless of t and the past 
outcomes y;,..., ¥;-1. We call this class the class of constant experts. Next we determine 
the asymptotic value of the minimax regret V,(F) for this class. 


Theorem 9.2. The minimax regret V,(F) of the class F of all constant experts over the 
alphabet Y = {1, 2} defined above satisfies 


VAS ie ie 
n =7 inn 5; mMm- Ens 
2 2"2 


where €n > Oas n > œ. 
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Proof. Recall that by Theorem 9.1, 
VF) = In $. sup fO”). 
yreyn JEF 


Now assume that the number of 1’s in the sequence y” € {1, 2}” is nı and the number of 
2’s is n2 = n — nı. Then for the expert f that predicts according to (q, 1 — q), we have 
fr") = "C1 = q)”. 

Then it is easy to see, for example, by differentiating the logarithm of the above expres- 
sion, that this is maximized for q = n/n, and therefore 


nare (2), 
yreyn 


Since there are (7) sequences containing exactly nı 1’s, we have 


ven nF (GN GN 


We show the proof of the upper bound V, (F) < 5 Inn + 5 In 5 + o(1), whereas the similar 
proof of the lower bound is left as an exercise. Recall Stirling’s formula 


~v2rn (=) e2 <n! < Van (=) eee 
e es e 

(see, e.g., Feller [96]). Using this to approximate the binomial coefficients, each term of 

the sum may be bounded as 


(2) (ay (@)" r 1 n pliant) 
nı n n J2n \ nin 


n—1 
n 1 
V (F) < In ( fe ere ye |, 
( 27 mar vrm 


Writing the last sum in the expression as 


Thus, 


-1 -1 
=) r 
n=l HEAT? n=l K a (1 a 2) 

we notice that it is just a Riemann approximation of the integral 


T 


1 
1 
—— d= 
Í vx(l — x) 


(see the exercises). This implies that limp — oo Da 1/4 /nın = m, and so 


Ea 
Vi(F) < In (a + o(1)) D 


as desired. W 
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Remark 9.2. (m-ary alphabet). In the general case, when m > 2 is a positive integer, 
Theorem 9.2 becomes 
1 n r(1/2)" 


m — 
Vi(F) = 5 In a7 + In Fan/D + o(1), 


where I" denotes the Gamma function (see Exercise 9.8). The fact that the minimax regret 
grows as a constant times Inn, the constant being the half of the number of “free param- 
eters,” is a general phenomenon. In Section 9.9 we discuss the class of Markov experts, a 
generalization of the class of constant experts discussed here, that also obeys this formula. 
We study V,,(F) in a much more general setting in Section 9.10. In particular, Corollary 9.1 
in Section 9.11 establishes a result showing that, under general conditions, V, (F) behaves 
like A Inn, where k is the “dimension” of class F. 


9.6 The Laplace Mixture 


The purpose of this section is to introduce the idea of mixture forecasters discussed in 
Section 9.2 to certain uncountably infinite classes of experts. In Section 3.3 we have 
already extended the exponentially weighted average forecaster over the convex hull of 
a finite set of experts for general exp-concave loss functions. Special properties of the 
logarithmic loss allow us to derive sharp bounds and to bring the mixture forecaster into a 
particularly simple form. 

For simplicity we show the idea for the class of all constant experts introduced in 
Section 9.5. Once again, we gain simplicity by considering only the case of m = 2. 
That is, the outcome space is Y = {1, 2} and D = fq, l-—qe R? : q € [0, 1}, so that 
each expert in F predicts, at each time rf, according to the vector (q, | — q) € D regard- 
less of ż and the past outcomes y1, ..., y;-1. Theorem 9.2 shows that V, (F) ~ $Inn + 
5 Ines 

In Section 9.5 we pointed out that, for finite classes of experts, the exponentially weighted 
average forecaster assigns, to each sequence y”, the average of the probabilities assigned 
by each expert; that is, P O”) = 4 yy FOG. 

This idea may be generalized in a natural way to the class of constant experts. As before, 
let nı and n2 denote the number of 1’s and 2’s in a sequence y”. Then the probability 
assigned to such a sequence by any expert in the class has the form q”'(1 — q)”. The 
Laplace mixture of these experts is defined as the forecaster that assigns, to any y” € {1, 2}”, 
the average of all these probabilities according to the uniform distribution over the class F; 
that is, 


1 
Py") = f "a-oa. 
0 
After calculating this integral, it will be easy to understand the behavior of the forecaster. 


Lemma 9.1. 


1 1 
ma = q)"dg = ——_.. 
[a aa aD) 
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Proof. We may prove the equality by a backward induction with respect to n;. If n; =n, 
we clearly have i q"dq = 1/(n + 1). On the other hand, assuming 


1 
q" a — qy2"'dg = ——_~ 
Í (n a It) 


and integrating by parts, we obtain 


1 1 
n n m= ntl m-1 
[ wa-omed = TE d -— q) dq 
0 nı +1 0 


n—n, 1 


el. Ge He) 
1 


+D) 


The first thing we observe is that the actual predictions of the Laplace mixture forecaster 
can be calculated very easily and have a natural interpretation. Assume that the number of 
1’s and 2’s in the past outcome sequence y’~! are t; and t2. Then the probability the Laplace 
forecaster assigns to the next outcome being | is, by Lemma 9.1, 


es 
BO) GG) _ a+ 
S t-11) 1 ae j 
Pry’) Toa) t+1 
Similarly, P;(2 | y’~!) = (t2 + 1)/(t + 1). We may interpret p;,(1 | y’~!) as a slight modifi- 
cation of the relative frequency f, /(t — 1). In fact, the Laplace forecaster may be interpreted 
as a “smoothed” version of the empirical frequencies. By smoothing, one prevents infinite 
losses that may occur if tf; = 0 or f = 0. All we need to analyze the cumulative loss of the 
Laplace mixture is a simple property of binomial coefficients. 


PAT = 


Lemma 9.2. For all 1 <k <n, 


Proof. If the random variables Y;,..., Y, are drawn i.i.d. according to the distribution 
PLY; = 1] = 1 — P[Y; = 2] = k/n, then the probability that exactly k of them equals 1 


is (£) (oF (ayes and therefore this last expression cannot be larger than 1. W 


Theorem 9.3. The regret of the Laplace mixture forecaster satisfies 


sup (Zo — inf Ly") = In(n + 1). 
y” e{1,2}" fEF 


Proof. Letn, and nz denote the number of 1’s and 2’s in the sequence y”. We have already 
observed in the proof of Theorem 9.2 that 


n n nij” (Ng\" 
sup g™(1—g" = (Z) (3V7. 


qe€[0,1] n 
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Thus, the regret of the forecaster for such a sequence is 
SUPge(0,11 7" (1 — q)” 
Jy a" — qdq 
Gan 


ny 


< In(n + 1) (by Lemma 9.2). 


L(y") — inf L(y”) =1 
(y") m fO”)=ln 


= ln (by Lemma 9.1) 


Equality is achieved for the sequence y” = (1, 1,...,1). E 


Thus, the extremely simple Laplace mixture forecaster achieves an excess cumulative 
loss that is of the same order of magnitude as that of the minimax optimal forecaster, 
though the leading constant is 1 instead of 1/2. In the next section we show that a slight 
modification of the forecaster achieves this optimal leading constant as well. 


Remark 9.3. Theorem 9.3 can be extended, in a straightforward way, to the general case 
of alphabet size m > 2. In this case, 


oe ' n+m-1 
sup { L(y") — inf LQ”) ) =In < (m — 1)ln(n + 1) 
yrey" fEF m—1 


(see Exercise 9.10). 


9.7 A Refined Mixture Forecaster 


With a small modification of the Laplace forecaster, we may obtain a mixture forecaster 
that achieves a worst-case regret comparable to that of the minimax optimal normalized 
maximum likelihood forecaster. The proof of Theorem 9.3 reveals that the Laplace mixture 
achieves the largest regret for sequences containing either very few 1’s or very few 0’s. 
Because the optimal forecaster is an equalizer (recall Theorem 9.1), a good forecaster should 
attempt to achieve a nearly equal regret for all sequences. This may be done by modifying 
the mixture so that it gives a slightly larger weight to those experts that predict well on these 
critical sequences. The idea, first suggested by Krichevsky and Trofimov, is to use, instead 
of the uniform weighting distribution, the Beta (1/2, 1/2) density 1/ (x /q(I = q)). (See 
Section A.1.9 for some basic properties of the Beta family of densities.) Thus, we define 
the Krichevsky—Trofimov mixture forecaster by 


a" =a)" a 
o t¥qI—q) 
where we use the notation of the previous section. It is easy to see, by a recursive argument 


similar to the one seen in the proof of Lemma 9.1, that the predictions of the Krichevsky— 
Trofimov mixture may be calculated by 


Pry") = 


x = ti + 1/2 
ction ee n 


a formula very similar to that obtained in the case of the Laplace forecaster. 
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The performance of the forecaster is easily bounded once the following lemma is 
established. 


Lemma 9.3. For all q € [0, 1], 
1 gm — q)” 1 /m\™ [mym 
eta E O 
o ayq -q) 2J/n \n n 
The proof is left as an exercise. On the basis of this lemma we immediately derive the 
following performance bound. 


Theorem 9.4. The regret of the Krichevsky—-Trofimov mixture forecaster satisfies 


a 1 
sup {| L(y") — inf L o) < -lnn +1n2. 
y”e{1,2}" ( fEF f 2, 


Proof. 


sUPge0,11 7" (1 — gq)” 
"@™(1—q)” a 
o z/qd—@q) 
EE 
n n 
| g” — q)” 
o my4- q) 1 
< In (2V7) (by Lemma 9.3) 


L(y") — inf L; =1 
Q”) m fQ") = In 


as desired. W 


Remark 9.4. The Krichevsky—Trofimov mixture estimate may be generalized to the class 
of all constant experts when the outcome space is Y = {1,...,m}, where m > 2 is an 
arbitrary integer. In this case the mixture is calculated with respect to the so-called 
Dirichlet(1/2, --- , 1/2) density 


T(m/2) 1 
$p) = —| [| — 
raja FS eG 
over the probability simplex D, containing all vectors p = (p), ats p(m)) e R” with 


nonnegative components and adding up to 1. As for the bound of Theorem 9.4, one may 
show that the worst-case regret of the obtained forecaster 


Pon = fT] paren) ap 
j=l 


(where n1, ..., Am denote the number of occurrences of each symbol in the string y”) is 
upper bounded by 
UT gee” YS eg eh 
——Inn+lIn n o(1). 
2 r(m/2) 2 
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This upper bound exceeds the minimax regret by just a constant mot In2. Moreover, the 
forecaster may easily be calculated by the simple formula 
ti + 1/2 


~ t-l . 
Se y =1,...,m, 
Pily) AFM i m 


where t; denotes the number of occurrences of i in y'~!. 


To understand the behavior of the Krichevsky—Trofimov forecaster, we derive a lower 
bound for its regret that holds for every sequence y” of outcomes. The following result 
shows that this mixture forecaster is indeed an approximate equalizer, since no matter what 
the sequence y” is, its regret differs from tlnn by at most a constant. (Recall that the 
minimax optimal forecaster is an exact equalizer.) 


Theorem 9.5. For all outcome sequences y” € {0, 1}", the regret of the Krichevsky- 
Trofimov mixture forecaster satisfies 


a 1 
L(y") — inf L¢(y") = = 1 O(1). 
(y") iit, sO”) 5 nn+0(1) 


Proof. Fix an outcome sequence y”. As before, let nı and n2 denote the number of 
occurrences of 1’s and 2’s in y”. It suffices to derive a lower bound for the ratio of 
maxgejo,11 q9" (1 — q)” = (nı/n)™"(n2/n)™ and the Krichevsky—Trofimov mixture proba- 
bility Pa (y”). To this end, observe that this probability may be expressed in terms of the 
gamma function as 


r (nı +3)T (m+ 3) 
mn! 


Pn o”) = 
(see Section A.1.9). Thus, 


m n!ni™ n” 
F+ iP (m+ ian 


In order to investigate this quantity, introduce the function 


Ad 1 
L(y") — inf L p(y") — -Inn =1 
(y") me fO”) z ia ln 


m (ny + n)! n” n” 


F 3 = 
(nı, n2) F re LT (n2 + +) (ny + nyt /ny + n2 


(9.1) 


defined for all pairs of positive integers nı, n2. A straightforward calculation, left as an 
exercise, shows that F is decreasing in both of its arguments. Hence, F (n1, n2) > F (n,n), 
and therefore 

n (2n)! n” 


T (n + 1) On Jn 


a 1 
L(y") — inf L ¢(y") — -Inn > 1 
Q”) mi sO”) 5 nn zin 


Using Stirling’s approximation (x) = v27 (x/e)* [yx + o(1)) as x —> œ, it is easy 
to see that the right-hand side converges to a positive constant as n > oo. W 


9.8 Lower Bounds for Most Sequences 261 


9.8 Lower Bounds for Most Sequences 


In Section 9.4 we determined the minimax regret V,,(F) exactly for any arbitrary class F of 
experts. The definition of V,(F) implies that for any forecaster P, there exists a sequence 
y” € Y” such that the regret L(y") — inf rer L f(y”) is at least as large as V, (F). In this 
section we point out that in some cases one may obtain much stronger lower bounds. In 
fact, we show that for some classes F, no matter what the forecaster P, is, the regret cannot 
be much smaller than V,,(F) for “most” sequences of outcomes y”. This indicates that the 
minimax value is achieved not only for an exceptional unfortunate sequence of outcomes, 
but in fact for “most” of them. (Of course, since the minimax optimal forecaster is an 
equalizer, we already knew this for p*. The result shown in this section indicates that all 
forecasters share this property.) What we mean by “most” will be made clear later. We just 
note here that the word “most” does not directly refer to cardinality. 

In this section, just like in the previous ones, we focus on the special case of binary 
alphabet Y = {1, 2} and the class F of constant experts. This is the simplest case for which 
the basic ideas may be seen in a transparent way. The result obtained for this simple class 
may be generalized to more complex cases such as constant experts over an m-ary alphabet, 
Markov experts, and classes of experts based on finite state machines. Some of these cases 
are left to the reader as exercises. 

Thus, the class we consider F contains all probability distributions on {1, 2}” that assign, 
to any sequence y” € {1, 2}", probability g/(1 — q)"-/, where j is the number of 1’s in 
the sequence and q € [0, 1] is the parameter determining the expert. Recall that for such a 
sequence the best expert assigns probability 


j = j J n -j n—j 
max 1— Pe 
dont ( 4) (4) ( n ) 


and therefore inf rer L f(y") = —j ln Å — (n — j)In™, 
Next we formulate the result. To this end, we partition the set {1, 2}” into n + 1 classes 
of types according to the number of 1’s contained in the sequence. To this purpose, define 


the sets 


f= fy” € {1,2}" : the number of 1’s in y” is exactly i}, j=0,1,...,n. 


In Theorem 9.6 think of 6, as a small positive number, and ¢, < 6, even smaller but 
still not too small. For example, one may consider 6, ~ 1/lnlnn and ¢, ~ 1/lnn or 
ôn ~ 1/VInInn and €, ~ 1/InInn to get a meaningful result. 


Theorem 9.6. Consider the class F of constant experts over the binary alphabet Y = {1, 2}. 
Let €, be a positive number, and consider an arbitrary forecaster Dn. Define the set 


PA 1 C 
A=4y" E41, 2)": LO”) < inf LO) + lng — In — 
fy" eaz O9 < inf LQ") + 5 Inn nc} 
where C = J e!/°/./2 ~ 1.4806. Then, for any ôn > 0, 


f ANII =a) ey 
|T;| ôn 


If €, is small, the set A contains all sequences for which the regret of the forecaster 
Pn is significantly smaller than the minimax value V,(F) ~ $Inn (see Theorem 9.2). 
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Theorem 9.6 states that if €, < ôn, the vast majority of types T; are such that the proportion 
of sequences in T; falling in A is smaller than ô. 


Remark 9.5. The interpretation we have given to Theorem 9.6 is that for any forecaster, the 
regret cannot be significantly smaller than the minimax value for “most” sequences. What 
the theorem really means is that for the majority of classes T; the subset of sequences of T; 
for which the regret is small is tiny. However, the theorem does not imply that there exist few 
sequences in {1, 2}” for which the regret is small. Just note that a relatively small number 
of classes of types (say, those with j between n/2 — 10./n and n/2 + 10./n) contain the 
vast majority of sequences. Indeed, the forecaster that assigns the uniform probability 2~” 
to all sequences will work reasonably well for a huge number of sequences. This example 
shows that the notion of “most” considered in Theorem 9.6 may be more adequate than just 
counting sequences. 


Proof. First observe that the logarithm of the cardinality of each class T; may be bounded, 
using Stirling’s formula, by 


n 
In |T;| = in( ) 
J 
J n-j 
To ea 
j hag 2x j(n— j)e!/® 


n\i n\"s 1 
> In{ - - Inn —InC, 
j n—-j 2 


where at the last step we used the inequality y j (n — j) < n/2. Therefore, for any sequence 
y” € T;, we have 


x n 1 
iO ) < In{T;|+ ae +1nc. 
This implies that if y” € A N T}, then 
= 1 
L(y") < In|Tj| + Inn — In — 


n 


or, in other words, 


PrO”) = 
ee |Tj|nép 


To finish the proof, note that 
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(where T (y") denotes the set T; containing y”) 
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9.9 Prediction with Side Information 


The purpose of this section is to extend the framework of prediction when the forecaster 
has access to certain “side information.” At each time instant, before making a prediction, 
a side information symbol is revealed to the forecaster. In our setup this side information 
is completely arbitrary, it may contain any external information and it may even depend 
on the sequence to be predicted. In this section we restrict our attention to the case when 
side information comes from a finite set. In Chapter 11 we develop a general framework of 
prediction with side information when the side information is a finite-dimensional vector, 
and the class of experts contains linear functions of the side information. The formal setup 
is the following. 

We consider prediction of sequences taking values in a finite alphabet Y = {1,..., m}. 
Let K be a positive integer, and let G|,..., Gx be “base” classes of static forecasters. (We 
assume the static property to lighten notation, the definitions and the results that follow can 
be generalized easily.) The class G; contains forecasters of the form 


go") =] [P oo, 


t=1 


where foreacht = 1, ..., n the vector (eu ) (1) gt! " (m )) is an element of the probability 
simplex D in R”. At each time f, a side information symbol z; € Z = {1,..., K } becomes 
available to the forecaster. The class F of forecasters against which our predictor competes 
contains all forecasters f of the form 


FO Ly 21) = fO |z) = 8.70), 


where for each j, t; is the length of the sequence of times s < t such that zs = j. In other 
words, each f ignores the past and uses the side information symbol z, to pick a class G,,. 
In this class a forecaster is determined on the basis of the subsequence of the past defined 
by the time instances when the side information coincided with the actual side information 
z,. Note that in some sense f is static, but z; may depend in an arbitrary manner on the 
sequence of past (or even future) outcomes. 

The loss of f € F for a given sequence of outcomes y” € Y” and side information 
z” EZ" is 


-$ in for Ly z) = — Dingo). 
t=1 t=1 


Our goal is to define forecasters P, whose cumulative loss 


-$ nA yT z) 
t=1 


is close to that of the best expert inf ex — X; In f(y; | y’~!, z+) for all outcome sequences 


y” e y” and side-information sequences z” € Z”. Assume that for each static expert class 
G; we have a forecaster g with worst-case regret 


n 


Q) 
, gr Or) 
Vaq”, Gj) = sup sup ERE, 

YEY" gEG; p=] qi Ox | NES ) 
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On the basis of these “elementary” forecasters, we may define, in a natural way, the 
following forecaster p with side information 


1 


ply Ly", z) = aE O 1 F), 


where for each j, y; denotes the sequence of y,’s (s < t) such that z, = j. In other words, 
the forecaster p looks back at the past sequence y;,..., y;-1 and considers only those 
time instants at which the side-information symbol agreed with the current symbol z;. The 
prediction of p is just that of ©” based on these past symbols. The performance of p may 
be bounded easily as follows. 


Theorem 9.7. For any outcome sequence y” € Y" and side information sequence z” € Z”, 
the regret of forecaster p with respect to all forecasters in class F satisfies 


“file | yt 20) G 
sup ) In ——_—_—_ < =) Vi, (q4, Gi), 
fer BiG Jo sy.) d 


where n; = X, I,=;} is the number of occurrences of symbol j in the side-information 
sequence. 


Proof. 


Air | y' =L 25) 


su In 
DD Pii | yT l sZt) 


t=1 
E) 
á 8 O) 
= sup 5 In mS (by definition of p) 
FEF E Oi lY) 


8P O) 
S 3 l=; 


I "GOLD | yi) 


K n g 
yr) 
= 5 sup > Iz=; In Fe 
=1 8G; 1 qP Or | y) 


Bee 


ae the expression within the parentheses depends only on gV’) 


SEn q” ,Gj). m 


The next simple example sheds some light on the power of this simple result. 


Example 9.1 (Markov forecasters). Let k > 1 and consider the class M+ of all kth order 
stationary Markov experts defined over the binary alphabet Y = {1, 2}. More precisely, 
M, contains all forecasters f for which the prediction at time ¢ is a function of the last 
k outcomes (y;-¢, .--, ¥r-1) (independent of t and outcomes ys for s < t — k). In other 
words, for a Markov forecaster one may write f,(y | y’~!) = fay | voy Such forecasters 
are also considered in Section 8.6 for different loss functions. As each prediction of a kth 
order Markov expert is determined by the last k bits observed, we add a prefix Y-k+1, - -< , Yo 
to the sequence to predict in the same way as we did in Section 8.6. 
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To obtain an upper bound for the minimax regret V (M+), we may use Theorem 9.7 
in a simple way. The side information z; is now defined as (y;-x,..., Y:-1), that is, z; 
takes one of K = 2* values. If we define G; =--- = Gx as the class G of all constant 
experts over the alphabet Y = {1, 2} defined in the previous sections, then it is easy to 
see that class F of all forecasters using the side information (y;_x, ..., Yr—1) is just class 
Mg. Let qP = - - - = q™ be the Krichevsky-Trofimov forecaster for class G, and define 
the forecaster f as in Theorem 9.7. Then, according to Theorem 9.7, for any sequence of 
outcomes y",,, the loss of f may be bounded as 


2k 
L(y") — inf LQ") < 2 Vz; (0). 
J= 


According to Theorem 9.4, 


1 
V,(G) < 5 Inn + In2. 


Using this bound, we obtain 
ay 
LQ") — inf LQ") < X` -Inz +2 n2 
(") — inf piatz Tj 
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where we used the arithmetic-geometric mean inequality. 
The upper bound obtained this way can be shown to be quite sharp. Indeed, the minimax 
regret V,,(M,) may be seen to behave like 2! In(n / 2") (see Exercise 9.13). 


9.10 A General Upper Bound 


Next we investigate the minimax regret V, (F) for general classes of experts. We derive a 
general bound that shows how the “size” of the class F affects the cumulative regret. 
To any class F of experts, we associate the metric d defined as 


d(f.8) = |$ sup(in fOr |») — Inge | yD)’. 


t=1 » 


Denote by N(F, £) the -covering number of F under the metric d. Recall that for any 
€ > 0, the e-covering number is the cardinality of the smallest subset F’ C F such that for 
all f € F there exists a g € F’ such that d( f, g) < £. The main result of this section is the 
following upper bound. 
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Theorem 9.8. For any class F of experts, 
Vi(F) < inf (inner, +24 f VinN(F, Das) : 
E> 0 


Note that if F is a finite class, then the right-hand side converges to In|F| as € > 0, 
and therefore we recover our earlier general bound. However, even for finite classes, the 
right-hand side may be significantly smaller than In |F|. Also, this result allows us to derive 
upper bounds for very general classes of experts. 

As a first step in the proof of Theorem 9.8, we obtain a weak bound for V, (F). This will 
be later refined to prove the stronger bound of Theorem 9.8. 


Lemma 9.4. For any class F of experts, 


D/2 
V, (F) < 24 vIn N(F, €) de, 
0 


where D = sup; gef d (f, g) is the diameter of F. 


Proof. Recall that, by the equalizer property of the normalized maximum likelihood fore- 
caster p* established in Theorem 9.1, for all y” € V”, V,(F) = SUP feF In ae Because 
the right-hand side is a constant function of y”, we may take any weighted average of it 
without changing its value. The trick is to weight the average according to the probability 


distribution defined by p*. Thus, we may write 


VF) = So (som n 3) zon 
yneyn JEF PO ) 


If we introduce a vector Y” = (Y1, ..., Y,) of random variables distributed according to 
p;., we obtain 


Vi(F) = E E In a 


fer DY") 


se “fi | YT) 
[ap om AO | 


n Y yt! Y y'-1 
<E sup J (m 4 2 |m SO l wa) ; 
RON PRA pm, (Y=) 
where the last step follows from the nonnegativity of the Kullback—Leibler divergence of 
the conditional densities, that is, from the fact that 


ly p(y; | yr-l = yl) si 

i aa Y, | yt-1 = yl = 
fX | yo) 

(see Section A.2). Now, for each f € F let 


n 


POEN Ora ames ah FÆIÐ aa 
SaS 2 G Pr Ly) [i mw, yy )) 


so we have V,(F) < 2E [supper Ts], where we write Ty = T;(Y"). 
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To obtain a suitable upper bound for this quantity, we apply Theorem 8.3. To do this, 
we need to show that the process {Ty > f EeF } is indeed a subgaussian family under 
the metric d. (Sample continuity of the process is obvious.) To this end, note that for any 
JEF, 


TOD- = DZ"), 


t=1 


where 


t 1 fOr | y!) | fŒ. | yt! = a) 
Zt = l l 
s ? 2 (1 g(r | yl) Š gY, | yolr yl) 


Now it is easy to see that Ty — T, = T (Y ”) — T,(Y") is a sum of bounded martingale 
differences with respect to the sequence Yj, Y2,..., Y,; that is, each term Z, has zero 
conditional mean and range bounded by 2d;(f, g). Then Lemma A.6 implies that, for all 
A> 0, 


2 
cients | < exp (Sar o?) 


Thus, the family {T;: f € F} is indeed subgaussian. Hence, using V,(F) < 
J [sup fer T;] and applying Theorem 8.3 we obtain the statement of the lemma. W 


N 


Proof of Theorem 9.8. To prove the main inequality, we partition F into small subclasses 
and calculate the minimax forecaster for each subclass. Lemma 9.4 is then applied in each 
subclass. Finally, the optimal forecasters for these subclasses are combined by a simple 
finite mixture. 

Fix an arbitrary € > 0 and let G = {g),..., gy} be an e-covering of F of minimal size 
N = N(F, £). Determine the subsets Fi, ..., Fy of F by 


FratfeF die) < d(f, gj) forall j =1,..., N}; 


that is, F; contains all experts that are closest to g; in the covering. Clearly, the union 
of Fi, ..., Fy is F. For each i = 1,..., N, let g denote the normalized maximum 
likelihood forecaster for F; 


SUP FEF, farO”) 
reyr SUP fer, fax”) 


Wyo) 
n ( )= 
oe 


Now let the forecaster pẹ be the uniform mixture of “experts” g“?,..., ¢. Clearly, 
Va (F) < inf..o Va(Ppe, F). Thus, all we have to do is to bound the regret of p,. To this end, 
fix any y” € Y” and let k = k(y") be the index of the subset Fy containing the best expert 
for sequence y”; that is, 


In sup f(y") = In sup f(y"). 
SEF SEF x 
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Then 
supper f(y") gy") sups, f(y") 
In =In In z 
Pe(y") PQ”) go”) 
On the one hand, by the upper bound for the loss of the mixture forecaster, 
(kyn 
sup In È Q” <lnN. 
y" Pe Q”) 
On the other hand, 
su g sup ¢. i 
ee < max supln supr, FO") = max V,(F;). 
y" gy") i=1..,N yn gO”) i=1,...,N 


Hence, we get 


ER 


Now note that the diameter of each element of partition F;,..., Fy is at most 2e. Hence, 
applying Lemma 9.4 to each F;, we find that 


Seer 


<NE, e) +24 f In N(F, ô)dô, 
0 
which concludes the proof. W 


Theorem 9.8 requires the existence of finite coverings of F in the metric d. But since 
the definition of d involves the logarithm of the experts’ predictions, this is only possible 
if all f,(j | y’~') are bounded away from 0. If the class of experts F does not satisfy this 
property, one may appeal to the following simple property. For simplicity, we state the 
lemma for the binary-alphabet case m = 2. Its extension to m > 2 is obvious. 


Lemma 9.5. Let m = 2, and let F be a class of experts. Define the class F® as the set of 
all experts f® of the form 


FO ly = (A Ly), 
where 


ô ifx <6 
T3(X) = x ifx € [6, 1 — ô] 
1-6 ifx>1-6 


for some fixed O < 5 < 1/2. Then V (F) < V,(F®) + 2nô. 


Thus, to obtain bounds for V, (F), we may first calculate a bound for the truncated class 
F® using Theorem 9.8 and then choose ô to optimize the right-hand side of the inequality 
of Lemma 9.5. The bound of the lemma is convenient but quite crude, and the resulting 
bounds are not always optimal. 
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Proof. Simply observe that for any sequence y”, and any f € F, 


In f(y") 
=> nfo ly) 


t=1 


< 5 In fOO Ly D+ 5 In] 


tf Only) <1-6 t: fP Ody =1—8 

ô -1 
< Yo mfPo ly Yo (ma -8) +28) 
tf Only) <1-6 t: fry! )=1-6 


(since In 1 < In(1 — ô) + 6/(1 — ô) by concavity, and using 0 < ô < 1/2) 


< So (In f° Ly) +28) 


t=1 


= In f(y") + 208. 
Thus, for any forecaster P, 
L(y") — inf LQ”) = —Inp,(y”) + sup In fa”) 
feF JEF 


< — Inpa") + sup In f(y”) + 2nd 
fOEFO 


= L(y") - vee Lyo(y")+ 2nd. W 


9.11 Further Examples 


Parametric Classes 
Consider first classes F such that there exist positive constants k and c satisfying, for 
alle > 0, 
c/n 
InN(F, £) < kin we 
E 
This is the case for most “parametric” classes, that is, classes that can be parameterized by 


a bounded subset of R* in some “smooth” way provided that all experts’ predictions are 
bounded away from 0. 


Corollary 9.1. Assume that the covering numbers of the class F satisfy the inequality 
above. Then 


k 
ViA(F) < 5 Inn + o(Inn). 


The main term f Inn is the same as the one we have seen in the case of the class of all 
constant and Markov experts. In those cases we could derive much sharper expressions for 
V, (F) even without requiring that the experts’ predictions be bounded away from 0. On 
the other hand, this corollary allows us to handle, in a simple way, much more complicated 
classes of experts. An example is provided in Exercise 9.18. 
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Proof of Corollary 9.1. Substituting the condition on covering numbers in the upper bound 
of Theorem 9.8, the first term of the expression is bounded by k lnn + klnc — k ln e. Then 
the second term may be bounded as follows: 


24 f JVInN(F, 5)d5 < asein f x 29 


(by substituting x = /In(c./n/6) 


and writing a, = V/In(c./n/e)) 


1 2 
= 48ceVk +f e d 
a. E Jaje J 
(by integrating by parts) 


1 
s sevin [5 Jaje E 


(by using the gaussian tail estimate 
SE edx < e? 20) 
< 48/ka,e (whenever ee < C/N) 


< 48/2e/kIn(cV/n) (whenever e > 1/(c/n).) 


The obtained upper bound is minimized for 


1 k 
~ 48/2) nevn) 


So, for every n so large that 


l 
c n > 48/2 a) 


we have 


7 
nev) 4+klne + 6k, 


k 
Vi(F) < ane R 5 in 


which proves the statement. W 


Nonparametric Classes 

Theorem 9.8 may also be used to handle much larger, “nonparametric” classes. In such 
cases the minimax regret V, (F) may be of a significantly larger order of magnitude than the 
logarithmic bounds characteristic of parametric classes. We work out one simple example 
here. 

Let Y = {1, 2} be a binary alphabet, and consider the class F of all experts f such that 
fd ly’ = fC) € [6, 1 — 6], where ô € (0, 1/2) is some fixed constant, and for each 
t=2,3,...,n, fil) > fr_-1C1). (The case when 6 = 0 may be treated by Lemma 9.5.) 
In other words, F® contains all static experts that assign a probability to outcome 1 in a 
monotonically increasing manner. To estimate the covering numbers of F, consider the 
finite subclass G of F containing only those monotone experts g that take values of the 
form g,(1) = 6+ (i /kX(1 — 26),i = 0,...,k, where k is a positive integer to be specified 
later. It is easy to see that |G| = (1) < < - nyt if k < n, and |G| < 2% otherwise. On the 
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other hand, for any f € F™, if g is the expert in G closest to f, then for each t < n, 


1 
l Lj ga = 
az n f(y) — Ing;(y)| < 5 w IFO) — g0) 


1 
glee) — D] 
1 
<—. 
~ êk 
Thus, d(f, g) < /n/(ék). By taking k = ./n/(68e), it follows that the covering number of 
F® is bounded as 
(ny6)  ife> zi 


N(F®, £) < 
( ) S 2vn/ (êe) otherwise. 


Substituting this bound into Theorem 9.8, it is a matter of straightforward calculation to 
obtain 


Va (F®) = O (n! 8 m? n). 


Note that the radius optimizing the bound of Theorem 9.8 is ¢ ~ n!/°5~!/3 In! n. Finally, 
by Lemma 9.5, if F is the class of all monotonically predicting experts, without restricting 
predictions in [6, 1 — ô], then by Lemma 9.5, V (F) < V, (F )) + Ind. By optimizing the 
value of ô in the upper bound obtained, we get V,(7) = O (n? ne n). 


9.12 Bibliographic Remarks 


The literature about predicting “individual sequences” under the logarithmic loss function 
has been intimately tied with a closely related “probabilistic” setup in which one assumes 
that the sequence of outcomes is generated randomly by one of the distributions in the 
class of experts. Even though in this chapter we do not consider the probabilistic setup 
at all, often one cannot separate the literature on the two problems, sometimes commonly 
known as the problem of universal prediction. The related literature is huge, and here we 
only mention a small selection of references. A survey summarizing a large body of the 
literature on prediction under the logarithmic loss is offered by Merhav and Feder [214]. 
The tight connection of sequential probability assignment and universal (lossless) source 
coding goes back to Kolmogorov [185] and Solomonoff [274, 275]. Fitingof [99, 100] and 
Davisson [79] were also among the pioneers of the field. 

The connection of sequential probability assignment and data compression with arith- 
metic coding was first revealed by Rissanen [236] and Rissanen and Langdon [243]. One 
of the most successful sequential coding methods, also applicable to prediction, is the 
Lempel—Ziv algorithm (see [197,319] and also Feder, Merhav, and Gutman [95]). 

The equivalence of sequential gambling and forecasting under the logarithmic loss 
function was noted by Kelly [180]; see also Cover [69] and Feder [94]. 

De Santis, Markowski, and Wegman [260] consider the logarithmic loss in the context 
of online learning. Theorem 9.1 is due to Shtarkov [267] just like Theorem 9.2; see also 
Freund [111], Xie and Barron [312]. The Laplace mixture forecaster was introduced, in 
the context of universal coding, by Davisson [79], and also investigated by Rissanen [239]. 
The refined mixture forecaster presented in Section 9.7 was suggested by Krichevsky and 
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Trofimov [186]. Lemma 9.3 appears in Willems, Shtarkov, and Tjalkens [311]. It is shown 
in Xie and Barron [312] and Freund [111] that the Krichevsky—Trofimov mixture, in fact, 
achieves a regret 5 Inn + 5 In 5 + o(1). This is optimal even in the additive constant for all 
sequences except for those containing very few 1’s or 2’s. Xie and Barron [312] refine the 
mixture further so that it achieves a worst-case cumulative regret of 5 Inn + 5 In 5 + o(1), 
matching the performance of the minimax optimal forecaster. Xie and Barron [312] also 
derive the analog of all these results in the general case m > 2. Theorem 9.5 also appears 
n [312], where the case of m-ary alphabet is also treated and the asymptotic constant is 
determined. Szpankowski [282] develops analytical tools to determine V,,(F) to arbitrary 
precision for the class of constant experts; see also Drmota and Szpankowski [90]. 

The material of Section 9.6 is based on the work of Weinberger, Merhav, and Feder [307], 
who prove a similar result in a significantly more general setup. In [307] an analog lower 
bound is shown for classes of experts defined by a finite-state machine with a strongly 
connected state transition graph. The work of Weinberger, Merhav, and Feder was inspired 
by a similar result of Rissanen [238] in the model of probabilistic prediction. 

Lower bounds for the minimax regret under general metric entropy assumptions may be 
obtained by noting that lower bounds for the probabilistic counterpart work in the setup of 
individual sequences as well. We mention the important work of Haussler and Opper [152]. 

Theorem 9.8 is due to Cesa-Bianchi and Lugosi [53], who improve an earlier result 
of Opper and Haussler [228] for classes of static experts. A general expression for the 
minimax regret, not described in this chapter, for certain regular parametric classes has 
been derived by Rissanen [242]. More specifically, Rissanen considers classes F of experts 
fa.o parameterized by an open and bounded set of parameters © C R*. It is shown in [242] 
that under certain regularity assumptions, 


V, (F) = Eni -+n f y det(7 (0)) dé + o(1), 


where the k x k matrix /(@) is the so-called Fisher information matrix, whose entry in 
position (i, j) is defined by 


ny O° In fao”) 
-p had ) 5508, 


where 6; is the ith component of vector 6. Yamanishi [313] generalizes Rissanen’s results 
to a wider class of loss functions. The expressions of the minimax regret for the class of 
Markov experts were determined by Rissanen [242] and Jacquet and Szpankowski [168]. 

Finally, we mention that the problem of prediction under the logarithmic loss has 
applications in the study of the general principle of minimum description length (MDL), first 
proposed by Rissanen [237, 238, 241]. For quite exhaustive surveys see Barron, Rissanen, 
Yu [22], Griinwald [134], and Hansen and Yu [143]. 


9.13 Exercises 


9.1 Let F be the class of all experts such that for each f € F, f,(j | y1) = f(/) (with fF(j) = 0, 
Ee ı FCJ) = 1) independently of t and y'~!. For a particular sequence y” € Y”, determine the 
best expert and its cumulative loss. 


9.2 


9.3 


9.4 


9.5 


9.6 


9.7 


9.8 


9.9 
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Assume that you want to bet in the horse race only once and that you know that the jth horse 
wins with probability p; and the odds are 0), ..., Om. How would you distribute your money to 
maximize your expected winnings? Contrast your result with the setup described in Section 9.3, 
where the optimal betting strategy is independent of the odds. 

Show that there exists a class F of experts with cardinality |F| = N such that for all n > log, N, 
V,(F) = InN. This exercise shows that the bound InN achieved by the uniform mixture 
forecaster is not improvable for some classes. 

Consider class F of all constant experts. Show that the normalized maximum likelihood fore- 
caster p% is horizon dependent in the sense that if p; denotes the normalized maximum likelihood 
forecaster for some t < n (i.e., p; achieves the minimax regret V,(F)), then it is not true that 


Yo PhO") = po». 


Yo eyr-t 


Let F be a class of experts and let q and P be arbitrary forecasters (i.e., probability distributions 
over VY”). Show that 


5 q(y”)In we > oe q(y") In SUP ex faQ”) ) 


n 
LEa meyi qQ”) 


and that 


er fa”) 
Saona Sere HO") _ y= Dein, 
jra” q0") 


where p* is the normalized maximum likelihood forecaster and D denotes Kullback—Leibler 
divergence. 


Show that 


—— dr= 
f Vx(1 — x) 


Hint: Substitute x by sin? a. 


Show that for every n > 1 there exists a class F, of two static experts such that if P denotes the 
exponentially weighted average (or mixture) forecaster, then 


VCD, Fn) = 
vea =O" 


for some universal constant c (see Cesa-Bianchi and Lugosi [53]). Hint: Let Y = {0, 1} and let 
F, contain the two experts f, g defined by f(1 | y7!) = i and g(1 | y!) = 4 + x. Show, on 
the one hand, that V,,(F,) < cjn~!/? and on the other hand that V, (P, Fn) = c2 for appropriate 
constants C1, C2. 
Extend Theorem 9.2 to the case when the outcome space is Y = {1,..., m}. More precisely, 
show that the minimax regret of the class of constant experts is 

1 n M1 /2)” m—1 


M 
n= =o a t Taa O 


ne 
ln — + o(1) 
m 


(Xie and Barron [312]). 


Complete the proof of Theorem 9.2 by showing that V, (F) > 2 x Inn +5 7 In 3 Z + o(1). Hint: The 
proof goes the same way as that of the upper bound, but to get the right Sonsan you need to be 
a bit careful when n/n or n2/n is small. 
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9.10 


9.11 


9.12 


9.13 


9.14 


9.15 


9.16 


9.17 


Logarithmic Loss 


Define the Laplace forecaster for the class of all constant experts over the alphabet Y = 
{1,2,...,m} for m > 2. Extend the arguments of Section 9.6 to this general case. In partic- 
ular, show that the worst-case cumulative regret of the uniformly weighted mixture forecaster 
satisfies 


Eon- eG ein t”! DI 1 
sup ( 09 = inf 1) =1( TER ) <o- )In(n + 1). 


yneyn 


Prove Lemma 9.3. Hint: Show that the ratio of the two sides decreases if we replace n by n + 1 
(and also increase either nı or n) by 1) and thus achieves its minimum when n = 1. Warning: 
This requires some work. 

Show that the function defined by (9.1) is decreasing in both of its variables. Hint: Proceed by 
induction. 


Show that the minimax regret V,,(M,) of the class of all kAth-order Markov experts over a binary 
alphabet VY = {1, 2} satisfies 


Jk n 
V, (Mp) = 5 In x +0(1) 


(Rissanen [242].) 

Generalize Theorem 9.6 to the case when Y = {1,..., m}, with m > 2, and F is the class of 
all constant experts. 

Generalize Theorem 9.6 to the case when Y = {1, 2} and F = Mh is the class of all kth-order 
Markov experts. Hint: You may need to redefine classes T; as classes of “Markov types” 
adequately. Counting the cardinality of these classes is not as trivial as in the case of constant 
experts. (See Weinberger, Merhav, and Feder [307] for a more general result.) 

(Double mixture for markov experts) Assume that Y = {1, 2}. Construct a forecaster P such 
that for any k = 1, 2,... and any kth-order Markov expert f € Mx, 


~ n 
sup (L(y")— LQ") < 51n 5g + Akn, 
u 70") 2 oF 


where for each k, lim sup,,_, ., @k,n < Be < œ. Hint: For each k consider the forecaster described 
in Example 9.1. Then combine them by the countable mixture described in Section 9.2 (see also 
Ryabko [252, 253]). 

(Predicting as well as the best finite-state machine) Consider Y = {1,2}. A k-state finite- 
state machine forecaster is defined as a triple (S, F, G) where S is a finite set of k elements, 
F : S — (0, 1] is the output function, and G : Y x S > S is the next-state function. For a 
sequence of outcomes y1, y2,..., the finite-state forecaster produces a prediction given by the 
recursions 


8:=G(S-1,y:-1) and f,(1 | y') = F(s,) 


for t = 2,3,... while f,(1) = F(s,) for an initial state sı € S. Construct a forecaster that 
predicts almost as well as any finite-state machine forecaster in the sense that 


sup (L(y") — Ly(y")) = Onn) 


y” ey” 


for every finite-state machine forecaster f. Hint: Use the previous exercise. (See also Feder, 
Merhav, and Gutman [95].) 
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9.18 (Experts with fading memory) Let Y = {0, 1} and consider the one-parameter class F of 
distributions on {0, 1}" containing all experts f, with a € [0, 1], where each f is defined 
by its conditionals as ow = 1/2, fea | y1) = yı, and 


Eg a(2s — t) 
aq jyh ae ee 
F AJY) mO tes mo ) 
for all y'~! € {0, 1}'~! and for all t > 2. Show that 


Va (F) = Ollnn). 


Hint: First show using Theorem 9.8 that V,(F) < 1 lnn + 5 InIn va + In 1 + O(1) and then 
use Lemma 9.5. 
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Sequential Investment 


10.1 Portfolio Selection 


This chapter is devoted to the application of the ideas described in Chapter 9 to the problem 
of sequential investment. Imagine a market of m assets (stocks) in which, in each trading 
period (day), the price of a stock may vary in an arbitrary way. An investor operates on this 
market for n days with the goal of maximizing his final wealth. At the beginning of each 
day, on the basis of the past behavior of the market, the investor redistributes his current 
wealth among the m assets. Following the approach developed in the previous chapters, 
we avoid any statistical assumptions about the nature of the stock market, and evaluate 
the investor’s wealth relative to the performance achieved by the best strategy in a class of 
reference investment strategies (the “experts”’). 

In the idealized stock market we assume that there are no transaction costs and the 
amount of each stock that can be bought at any trading period is only limited by the 
investor’s wealth at that time. Similarly, the investor can sell any quantity of the stocks he 
possesses at any time at the actual market price. 

The model may be formalized as follows. A market vector X = (X1,..., Xm) for m assets 
is a vector of nonnegative real numbers representing price relatives for a given trading 
period. In other words, the quantity x; > 0 denotes the ratio of closing to opening price 
of the ith asset for that period. Hence, an initial wealth invested in the m assets according 
to fractions Q1, ..., Qm multiplies by a factor of eae x; Q; at the end of the period. The 
market behavior during n trading periods is represented by a sequence of market vectors 
x” = (X,,...,X,). The jth component of x,, denoted by xj, is the factor by which the 
wealth invested in asset j increases in the rth period. 

As in Chapter 9, we denote the probability simplex in R” by D. An investment strategy Q 
for n trading periods is a sequence Q}, ..., Q,, of vector-valued functions Q, : R l> D, 
where the ith component Q; (x7 1) of the vector Q,(x’—!) denotes the fraction of the current 
wealth invested in the ith asset at the beginning of the rth period on the basis of the past 
market behavior x’~!. We use 


Sn(Q,x")=]] z Xiu ouad) 
t=10 Viel 


to denote the wealth factor of strategy Q after n trading periods. The fact that Q, has 
nonnegative components summing to | expresses the condition that short sales and buying 
on margin are excluded. 
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Example 10.1 (Buy-and-hold strategies). The simplest investment strategies are the so 
called buy-and-hold strategies. An investor following such a strategy simply distributes 
his initial wealth among m assets according to some distribution Q; € D before the first 
trading period and does not trade anymore. The wealth factor of such a strategy, after n 
periods, is simply 


SQ, x = >> Oj] [xis 
j=l t=1 


Clearly, this wealth factor is at most as large as the gain max j=1,...,m TT 1 Xj Of the best 
stock over the same investment period and achieves this maximal wealth if Q; concentrates 
on this best stock. 


Example 10.2 (Constantly rebalanced portfolios). Another simple and important class of 
investment strategies is the class of constantly rebalanced portfolios. Such a strategy B 
is parameterized by a probability vector B = (B;,..., Bm) € D and simply Q,(x'"!) = B 
regardless of ¢ and the past market behavior x’~!. Thus, an investor following such a strategy 
rebalances, at every trading period, his current wealth according to the distribution B by 
investing a proportion Bı of this wealth in the first stock, a proportion Bz in the second 
stock, and so on. Observe that, as opposed to buy-and-hold strategies, an investor using a 
constantly rebalanced portfolio B is engaged in active trading in each period. The wealth 
factor achieved after n trading periods is 


S,B,x") =|] ( Xis si) , 

t=1 \i=1 

To understand the power of constantly rebalanced strategies, consider a simple market of 
m = 2 stocks such that the sequence of market vectors is (1, +) , (1, 2), (1, 5) 3 (ES 2)5 5 
Thus, the first stock maintains its value stable while the second stock is more volatile: on 
even days it doubles its price, whereas on odd days it loses half of its value. Clearly, on a 
long run, none of the two stocks (and therefore no buy-and-hold strategy) yields any gain. 
On the other hand, the investment strategy that rebalances every day uniformly (i.e., with 
B= (4, i) achieves an exponentially increasing wealth at a rate (9/8)"/*. The importance 
of constantly rebalanced portfolios is largely due to the fact that if the market vectors x; 
are realizations of an i.i.d. process and the number n of investment periods is large, then 
the best possible investment strategy, in a quite strong sense, is a constantly rebalanced 
portfolio (see Cover and Thomas [74] for a nice summary). 


As for the other models of prediction considered in this book, the performance of any 
investment strategy is measured by comparing it to the best in a fixed class of strategies. 
To formalize this notion, we introduce the worst-case logarithmic wealth ratio in the next 
section. In Section 10.3 the main result of this chapter is presented, which points out a 
certain equivalence between the problem of sequential investment and prediction under the 
logarithmic loss studied in Chapter 9. This equivalence permits one to determine the limits 
of any investment strategy as well as to design strategies with near optimal performance 
guarantees. In particular, in Section 10.4 a strategy called “universal portfolio” is introduced 
that is shown to be an analog of the mixture forecasters of Chapter 9. The so-called EG 
investment strategy is presented in Section 10.5 whose aim is to relieve the computational 
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burden of the universal portfolio. In Section 10.6 we allow the investor to take certain side 
information into account and develop investment strategies in this extended framework. 


10.2 The Minimax Wealth Ratio 


The investor’s objective is to achieve a wealth comparable to the best of a certain class of 
investment strategies regardless of the market behavior. Thus, given a class Q of investment 
strategies, we define the worst-case logarithmic wealth ratio of strategy P by 
n 
W,,(P, Q) = sup sup In SQ, 2") 
x” QcQ Si(P, x") 

Clearly, the investor’s goal is to choose a strategy P for which W,(P, Q) is as small 
as possible. W,,(P, Q) = o(n) means that the investment strategy P achieves the same 
exponent of growth as the best reference strategy in class Q for all possible market behaviors. 
The minimax logarithmic wealth ratio is just the best possible worst-case logarithmic wealth 
ratio achievable by any investment strategy P: 


Example 10.3 (Finite classes). Assume that the investor competes against a finite class 
Q={Q",..., QM} of investment strategies. A very simple strategy P divides the initial 
wealth in N equal parts and invests each part according to the “experts” Q. Then the total 
wealth of the strategy is 


N 
1 ; 
ny __ Gi) gan 
Sa (P, x") = N 2 Sa (Q, x") 
and the worst-case logarithmic wealth ratio is bounded as 
Sn (i) x" 
W,,(P, Q) = sup In ===- nv Sa(Q", x") 
x x =e 1 Sa (QO, x”) 


maxj=1,..,v Sa (QC, x”) 


max j= 1,..., N Sa (QO, x”) 


10.3 Prediction and Investment 


In this section we point out an intimate connection between the sequential investment 
problem and the problem of prediction under the logarithmic loss studied in Chapter 9. 
The first thing we observe is that any investment strategy Q over m assets may be used to 
define a forecaster that predicts the elements y, € Y = {1,...,m} of a sequence y” € Y” 
with probability vectors P, € D. To do this, we simply restrict our attention to those market 
vectors x that have a single component that is equal to 1 and all other components equal to 
0. Such vectors are called Kelly market vectors. Observe that a Kelly market is just like the 
horse race described in Section 9.3, with the only restriction that all odds are supposed to 
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be equal to 1. If x,,..., x, are Kelly market vectors and we denote the index of the only 
nonzero component of each vector x; by y;, then we may define forecaster f by 


FOS Y= Oya). 


We say that the forecaster f is induced by the investment strategy Q. With some abuse of 
notation we write S, (Q, y”) for S,(Q, x”) when x” is a sequence of Kelly vectors determined 
by the sequence y” of indices. Clearly, S (Q, y”) = f,(y”) if f is the forecaster induced 


by Q. 
To relate the regret in forecasting to the logarithmic wealth ratio, we use the logarithmic 
loss €(p,, y) = — In P; (yı | y’~!). Then the regret against a reference forecaster f is 


fn") _ 4, LO 
PrO”) P(y")’ 
where Q and P are the investment strategies induced by f and P (see Section 9.1). 


Now it is obvious that the investment problem is at least as difficult as the corresponding 
prediction problem. 


L, —Lyn =In 


Lemma 10.1. Let Q be a class of investment strategies, and let F denote the class of 
forecasters induced by the strategies in Q. Then the minimax regret 


V, (F) = inf sup sup In hO” 
Pn yn fef PrO”) 


satisfies W,(Q) > V, (F). 


Proof. Let P be any investment strategy and let p be its induced forecaster. Then 


Sa (Q, x") Sa (Q, y”) 
sup sup In ————— > max sup In ————— 
x” OcQ Sn(P, x") "EV" QeQ Si(P, y”) 
faQ”) 
= max sup ln 


YE feF PrO”) 
= V, (p, F) > V,(F). E 


Surprisingly, as it turns out, the investment problem is not genuinely more difficult than that 
of prediction. In what follows, we show that in many interesting cases the two problems 
are, in fact, equivalent in a minimax sense. 

Given a prediction strategy p, we define an investment strategy P as follows: 


5 t-1 t—1 T 
Da PG LY Pray (TT He) 
i t—1 : 
Deas Pry ) (ig w) 


Note that the factors []}_, xy,,, may be viewed as the return of the “extremal” investment 
strategy that, on each trading period ż, invests everything on the y,th asset. Clearly, the 
obtained investment strategy induces p, and so we will say that p and P induce each other. 

The following result is the key in relating the minimax wealth ratio to the minimax regret 
of forecasters. 


Psa) T 
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Theorem 10.1. Let P be an investment strategy induced by a forecaster p, and let Q be an 
arbitrary class of investment strategies. Then for any market sequence x", 


Sn(Q, x”) Mizi Qy,.0 


sup In ————— < max sup In 
QQ Sa (P, x") EY" QeQ Pnly") 


The proof of the theorem uses the following two simple lemmas. The first is an elementary 
inequality, whose proof is left as an exercise. 


Lemma 10.2. Let ai, ..., Gn, b1,..., bn be nonnegative numbers. Then 


where we define 0/0 = 0. 


Lemma 10.3. The wealth factor achieved by an investment strategy Q may be written as 
n n 
ORSEDD (1 sn (1 2,10") l 
yrey” \t=1 t=1 
If the investment strategy P is induced by a forecaster pn, then 


Sa(P, x)= Y) (re PrO”). 


yrey” \t=1 


Proof. First, we expand the product in the definition of S,(Q, x”): 


n 


SiOx) = J | dori Qi) 


t=1 \ j=l 


5 (1 ae 2,0") 


yrey" \t=1 
n n 
= >a (Ere) (1 2,10") : 
yrey" \t=1 t=1 


On the other hand, if an investment strategy is induced by a forecaster p, then 
n 


SAP.) = [| (XO pe P ie) 
1 


t=1 (j= 


pp Erion nO Pe TT) 
i eee Pr") (Ge a) 
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n 
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yrey” \t=1 


t 


where in the last equality we set eyteyet (ie as) Pag’) =1fort=1. m 


Proof of Theorem 10.1. Fix any market sequence x” and choose any reference strategy 
Q' € Q. To simplify notation, we write S,(y", x") = []}_, y,,.- Then using the expressions 
derived above for S,(Q’, x”) and S,(P, x”), we have 
SO) Pg" (Oye) 
Sn(P, x") Vyneye SnO”, X”) PrO”) 
SrO”, x) Tiz QET 
< max 
Y”: Sa". x")>0 SnO”, X”) PaO”) 
(by Lemma 10.2) 
r1 Qh T) 
max ————___ 
yneyn Pn (y") 
n t-1 
< max sup Trai Qr) Oyu ) 
yneyn QQ Pn”) 


An important and immediate corollary of Theorem 10.1 is that the minimax logarithmic 
wealth ratio W,,(Q) of any class Q of static strategies equals the minimax regret associated 
with the class of the induced forecasters. To make this statement precise, we introduce 
the notion of static investment strategies, similar to the notion of static experts in pre- 
diction problems. A static investment strategy Q satisfies Q,(x’~') = Q, € D for each 
t = 1,...,n. Thus, the allocation of wealth Q, for each trading period does not depend on 
the past market behavior. 


Theorem 10.2. Let Q be a class of static investment strategies, and let F denote the class 
of forecasters induced by strategies in Q. Then 


W,,(Q) = V, (F). 
Furthermore, the minimax optimal investment strategy is defined by 
Pagar pO ais) 
Diyteyt Pra") (Tz Bue) 


where p* is the normalized maximum likelihood forecaster 


’ 


Poe) = 


n 
SUPgcQ Iia Qy. 
x : 
yreyn SUPgeg Ii Qy, 


P,Q") = x 
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Proof. By Lemma 10.1 we have W,,(Q) > V,(F); so it suffices to prove that W,(Q) < 
V, (F). Recall from Theorem 9.1 that the normalized maximum likelihood forecaster p* is 
minimax optimal for the class F; that is, 


n 
max In sup Tint Qy Qy. 


= V,(F). 
wey” geo PO”) 


Now let P* be the investment strategy induced by the minimax forecaster p* for Q. By 
Theorem 10.1 we get 


S,(Q, x" n 
W, (Q) < sup sup In Sn(Q, x") < max sup In Miz Qy 


< = V,(F). 
x” QcQ Sn(P*, x") EY" QeQ Dey") 


The fact that the worst-case wealth ratio supy: SUPgeq Sn(Q, X")/Sn(P*, x") achieved by 
the strategy P* equals W,,(Q) follows by the inequality above and the fact that V, (F) < 
W,(Q). E 


Example 10.4 (Constantly rebalanced portfolios). Consider now the class Q of all con- 
stantly rebalanced portfolios. It is obvious that the strategies of this class induce the “con- 
stant” forecasters studied in Sections 9.5, 9.6, and 9.7. Thus, combining Theorem 10.2 with 
the remark following Theorem 9.2, we obtain the following expression for the behavior of 
the minimax logarithmic wealth ratio: 


Md/2)" 
T(m/2) 
This result shows that the wealth S,,(P*, Q) of the minimax optimal investment strategy 
given by Theorem 10.2 comes within a factor of n™”-D/2 of the best possible constantly 


rebalanced portfolio, regardless of the market behavior. Since, typically, supgeg Sn(Q, x") 
increases exponentially with n, this factor becomes negligible on the long run. O 


wW,(Q)= "i Inn +1n + o(1). 


10.4 Universal Portfolios 


Just as in the case of the prediction problem of Chapter 9, the minimax optimal solu- 
tion for the investment problem is not feasible in practice. In this section we introduce 
computationally more attractive methods, close in spirit to the mixture forecasters of Sec- 
tions 9.6 and 9.7. 

For simplicity and for its importance, in this section we restrict our attention to class Q of 
all constantly rebalanced portfolios. Recall that each strategy Q in this class is determined 
by a vector B = (B1, ..., Bm) in the probability simplex D in R”. In order to compete with 
the best strategy in Q, we introduce the universal portfolio strategy P by 


Jp By S:1(B, x!~!)u(B) dB 
Jp S-18, x (B) dB 


where u is a density function on D. In the simplest case jz is just the uniform density, 
though we will see that it may be advantageous to consider nonuniform densities such 
as the Dirichlet(1/2,..., 1/2) density. In any case, the universal portfolio is a weighted 
average of the strategies in Q, weighted by their past performance. We will see in the 
proof of Theorem 10.3 below that the universal portfolio is nothing but the investment 


Pi’) = j=l,....m, t=1,...,n, 
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strategy induced by the mixture forecaster (the Laplace mixture in case of uniform jz and 
the Krichevsky—Trofimov mixture if jz is the Dirichlet(1/2, ..., 1/2) density). 

The wealth achieved by the universal portfolio is just the average of the wealths achieved 
by the individual strategies in the class. This may be easily seen by observing that 


n m 
SAP.) = [ [X Pie xj 
t=1 j=l 


t Jo er Xj Bj Si-1B, x!) u(B) dB 
Jp Sri 0B, x'~)u(B) dB 


"fy S:(B, x!) (B) dB 
L3 Jp Sı-1®, x!) (B) dB 


z / S, (B, x") u(B) dB 
D 


because the product is telescoping and Sọ = 1. This last expression offers an intuitive 
explanation of what the universal portfolio does: by approximating the integral by a Riemann 
sum, we have 


Sn(P,x") © Y Qi SaB, x"), 


where, given the elements A; of a fine finite partition of the simplex D, we assume that 
B; € A; and Q; = f a, 4(B)AdB. The right-hand side is the capital accumulated by a strategy 
that distributes its initial capital among the constantly rebalanced investment strategies B; 
according to the proportions Q; and lets these strategies work with their initial share. In 
other words, the universal portfolio performs a kind of buy-and-hold over all constantly 
rebalanced portfolios. 

This simple observation is the key in establishing performance bounds for the universal 
portfolio. 


Theorem 10.3. If u is the uniform density on the probability simplex D in R”, then the 
wealth achieved by the universal portfolio satisfies 
S, (B, x”) 


sup sup ln < (m — l)ln(n + 1). 
x” BeD Sa (P, x”) 


If the universal portfolio is defined using the Dirichlet (1/2, ..., 1/2) density u, then 
Sa (B, x") m-l! r(1/2)” m-1 


1 < 1 l In2 1). 
Pe Ee a a a 


The second statement shows that the logarithmic worst-case wealth ratio of the universal 
portfolio based on the Dirichlet (1/2,..., 1/2) density comes within a constant of the 
minimax optimal investment strategy. 


Proof. First recall that each constantly rebalanced portfolio strategy indexed by B is 
induced by the “constant” forecaster p®, which assigns probability 


Paty") = By! o Byrn 
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to each sequence y” in which the number of occurrences of symbol j isn; (j = 1,...,m). 
By Lemma 10.3, the wealth achieved by such a strategy is 


SB, x)=) (1 sw PrO”). 
y” t=1 


Using the fact that the wealth achieved by the universal portfolio is just the average of the 
wealths achieved by the strategies in Q, we have 


S,(P,x") = f S,(B, x”)u(B) dB 
D 
32 (LIe) f touw a. 
y” t=1 D 


The last expression shows that the universal portfolio P is induced by the mixture forecaster 


pay") = Í p(y") u(B) cB. 


(Simply note that by Lemma 10.3 the wealth achieved by the strategy induced by the 
mixture forecaster is the same as the wealth achieved by the universal portfolio; hence the 
two strategies must coincide.) 

By Theorem 10.1, 


Sn(B, x”) Pro") 
sup sup In max sup In 


se ae i 
x” BeD Sa (P, x") ~ yreyn BeD PrO”) 


In other words, the worst-case logarithmic wealth factor achieved by the universal portfolio 
is bounded by the worst-case logarithmic regret of the mixture forecaster. But we have 
already studied this latter quantity. In particular, if u is the uniform density, then p,, is just 
the Laplace forecaster whose performance is bounded by Theorem 9.3 (for m = 2) and by 
Exercise 9.10 (for m > 2), yielding the first half of the theorem. 

The second statement is obtained by noting that if the universal portfolio is defined 
on the basis of the Dirichlet(1/2,..., 1/2) density, then p, is the Krichevsky—Trofimov 
forecaster whose loss is bounded in Section 9.7. E 


10.5 The EG Investment Strategy 


The worst-case performance of the universal portfolio is basically unimprovable, but it has 
some practical disadvantages. Just note that the definition of the universal portfolio involves 
integration over an m-dimensional simplex. Even for moderate values of m, the exponential 
computational cost may become prohibitive. In this section we describe a simple strategy 
P whose computational cost is linear in m, a dramatic improvement. Unfortunately, the 
performance guarantees of this version are inferior to those established in Theorem 10.3 
for the universal portfolio. 
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This strategy, called the EG investment strategy, invests at time t using the vector P, = 
(Pit, -.., Pm, t) where P; = (1/m,..., 1/m) and 


Pi 4~1 eX Xi t—1/Pi—1 + Xt— 
Pin = 1 EXP( NG aP) PD m, t=2,3,.... 


Di Piri exp(m(xj..-1/Pr—1 s X;_1)) 


This weight assignment is a special case of the gradient-based forecaster for linear regression 
introduced in Section 11.4: 


P; 1-1 exp(nV£;—1(Pr-1)i) 


Pit = <a 
O Eha Pia exp(nV 61 (P,-1))) 
when the loss functions is set as €;_;(P;_1) = — In P;—1 - x;_}. 
Note that with this loss function, L, = — lIn S(P, x”), where L, is the cumulative loss of 
the gradient-based forecaster, and L,(B) = — In S(B, x”), where L,,(B) is the cumulative 


loss of the forecaster with fixed coefficients B. Hence, by adapting the proof of Theo- 
rem 11.3, which shows a bound on the regret of the gradient-based forecaster, we can 
bound the worst-case logarithmic wealth ratio of the EG investment strategy. On the other 
hand, the following simple and direct analysis provides slightly better constants. 


Theorem 10.4. Assume that the price relatives x; all fall between two positive constants 
c < C. Then the worst-case logarithmic wealth ratio of the EG investment strategy with 


n = (c/C)/(8Inm)/n is bounded by 


lnm  nnC? C [n 
+ = lnm. 
n 8 ce? c\2 


Proof. The worst-case logarithmic wealth ratio is 


Ai B- x, 
max max In =~, 
x" BeD [[ Px 


where the first maximum is taken over market sequences satisfying the boundedness 
assumption. By using the elementary inequality In(1 + u) < u, we obtain 


n 
1PX = P,- x; 
n m 
P (Bi — Pit) Xis 
= P,- x 
t=1 i=l r 
n m m 
B. X jit Xi,t 
= J E pt 
: P, - x; : P, - x; 
{= j=l i= 
Introducing the notation 
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and noting that, under the boundedness assumption 0 < c < xj, < C, Z, € [0, C/c], we 
may rewrite the wealth ratio above as 


ner Sede lls DD Bi 


Because this expression is a linear function of B, it achieves its maximum in one of the 
comers of the simplex D, and therefore 


n n 
max In 21 B: x% < J J l. P; min ) g 
dhli ne AN 
BeD [Rx — rs it 


esj 


l jsl t=1 


Now note that 


> Witt exp (n Eii Gis/Ps $ x,)) exp (-n Yai A 
it = = = = ef ; z 
Wea Fesp (n ey Ps x,)) ia exp (Gu ees À 
Hence, P41, P2,... are the predictions of the exponentially weighted average forecaster 


applied to a linear loss function with range [0, C /c]. The regret 


ye P - min ye 


errs 


t=1 i=1 t=1 


can thus be bounded using Theorem 2.2 applied to scaled losses (see Section 2.6). (Note 
that this theorem also applies when, as in this case, the losses at time ¢ depend on the 
forecaster’s prediction P,-x,). E 


The knowledge of constants c and C can be avoided using exponential forecasters with 
time-varying potentials like the one described in Section 2.8. 


Remark 10.1. A linear upper bound on the worst-case logarithmic wealth ratio is inevitably 
suboptimal. Indeed, the linear upper bound 


m n m m m n 

5 Bj p2 (È Pastu) = n) = > Bj 5 a Pit (lie — t) 

j=1 t=1 \i=1 j=l ist \r=1 
is maximized for a constantly rebalanced portfolio B lying in a corner of the simplex D, 
whereas the logarithmic wealth ratio In TT ı (B - x,/P; - x;) is concave in B, and therefore it 
is possibly maximized in the interior of the simplex. Thus, no algorithm trying to minimize 
the linear upper bound on the worst-case logarithmic wealth ratio can be minimax optimal. 
Note also that the bound obtained for the worst-case logarithmic wealth ratio of the EG 
strategy grows as ./n, whereas that of the universal portfolio has only a logarithmic growth. 


The following simple example shows that the bound of the order of y/n cannot be improved 
for the EG strategy. Consider a market with two assets and market vectors x, = (1, 1/2) for 
allt. Then, for every wealth allocation P,, 1/2 < P, - x, < 1. The best constantly rebalanced 
portfolio is clearly (1, 0), and the worst-case logarithmic wealth ratio is 


1 n 
Soin IA > 2 Py, /2. 


t=1 
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In the case of the EG strategy, we may lower bound P2,, by 


exp (9 Do} 1 2P, yr) 
exp (1 Diz pic) + exp (n Dict 1 2P, wx) 
exp (- pit 1 2P, £) 


1 + exp (- pe 1 3, sy) 
_ exp(=n(t = D) 
= 2 
Thus, the logarithmic wealth ratio of the EG algorithm is bounded from below by 


Por = 


D exp(—n(t-—1)) 1 1-e™ 1 
ee By 


4 4 fei p 


t=1 


where the last approximation holds for large values of n. Since 77 is proportional to 1/./n, 
the worst-case logarithmic wealth ratio is proportional to y/n, a value significantly larger 
than the logarithmic growth obtained for the universal portfolio. 


10.6 Investment with Side Information 


The investment strategies considered up to this point determine their portfolio as a function 
of the past market behavior and the investment strategies in the comparison class. However, 
sometimes an investor may want to incorporate external information in constructing a 
portfolio. For example, the price of oil may have an effect on stock prices, and one may 
not want to ignore them even if oil is not traded on the market. Such arguments lead us to 
incorporating the notion of side information. We do this similarly as in Section 9.9. 

Suppose that, at trading period rt, before determining a portfolio, the investor observes 
the side information z;, which we assume to take values in a set Z of finite cardinality. 
For simplicity, and without loss of generality, we take Z = {1,..., K}. The portfolio 
chosen by the forecaster at time t may now depend on the side information z,. Formally, an 
investment strategy with side information Q is asequence of functions Q, : Ro! x Z—> D, 
t=1,...,n. Attimef, on observing the side information z,, the strategy uses the portfolio 
Q,(x'~!, z,). Starting with a unit capital, the accumulated wealth after n trading periods 
becomes 


n 
S.(Q,%",2") =| [ x QET, 2). 
t=1 
Our goal is to design investment strategies that compete with the best in a given reference 
class of investment strategies with side information. For simplicity, we consider reference 
classes built from static classes of investment strategies. More precisely, let Q),..., Ox 
be “base” classes of static investment strategies and let Q® € Q),..., 0 € Qk be 
arbitrary strategies. The class of investment strategies with side information we consider 
are such that 


=| = 
Q,(x' ’ Zt) R Qz, ’ 
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where Q/ € D is the portfolio of an investment strategy Q}? € Q; at the rth time period and 
nj; is the length of the sequence of those time instances s < t when zs = j (j = 1,..., K). 
In other words, a strategy in the comparison class assigns a static strategy OY” to any j = 
1,..., K and uses this strategy whenever the side information equals j. This formulation 
is the investment analogue of the problem of prediction with side information described in 
Section 9.9. 

In analogy with the forecasting strategies introduced in Section 9.9, we may consider 
the following investment strategy: let G,,..., Gg denote the classes of (static) forecasters 
induced by the classes of investment strategies Q4, . . . , Ox, respectively. Letg(,...,q¢ 
be forecasters with worst-case cumulative regrets (with respect to the corresponding refer- 
ence classes) 


n 


i) 3 PO) 
Vi(qy,G;) = sup sup ji eh, 
n yrey” g VEG; aA qP O, | y7!) 


On the basis of this, one may define the forecaster with side information in Section 9.9: 


1 


ply Ly z) = 46° 17) 


This forecaster now induces the investment strategy with side-information P,(x'—!,, z,) 
defined by its components 


zi 
AT ply: IT, (xy Pss | y5}, zs)) 


—1 
Deoa IL, (Xy,.s Ps Os | y, zs)) 


The following result is a straightforward combination of Theorems 9.7 and 10.1. The proof 
is left as an exercise. 


Pj, z) = 


Theorem 10.5. For any side-information sequence zı,...,Zn, the investment strategy 
defined above has a worst-case logarithmic wealth ratio bounded by 


K 
Sa (Q, xX", z”) 
sup sup In ————___ Va (q, G;) 
x" geo © SyCP sx", NEL j) 


If the base classes Q4, . . ., Qx all equal to the class of all constantly rebalanced portfolios 
and the forecasters q}? are mixture forecasters, then it is easy to see that the forecaster of 
Theorem 10.5 takes the simple form 


Jp Bj Sr., B. X,, |) 4(B) dB 
Ip S7, B, X, uB) dB 


Pk 2) j=1,...,m, t=1,...,A, 


where u is a density function on D and xy is the subsequence of the past market sequence 
x’! determined by those time instances s < t when z, = j. Thus, the strategy defined 
above simply selects the subsequence of the past corresponding to the times when the 
side information was the same as the actual value of side information z, and calculates a 


10.7 Bibliographic Remarks 289 


universal portfolio over that subsequence. Theorem 10.5, combined with Theorem 10.3, 
implies the following. 


Corollary 10.1. Assume that the universal portfolio with side information defined above is 
calculated on the basis of the Dirichlet (1/2, ..., 1/2) density u. Let Q denote the class 
of investment strategies with side information such that the base classes Q),..., Qx all 
coincide with the class of constantly rebalanced portfolios. Then, for any side information 
sequence, the worst-case logarithmic wealth ratio with respect to class Q satisfies 


Sn(Q, x", 2") 
sup sup In ————_—— 
x" QeO Sa (P, x", 2") 
K(m—-1 ra/2)” K(m—-1 
gis egg ES Dab ena 
2 K T(m/2) 2 
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The theory of portfolio selection was initiated by the influential work of Markowitz [209], 
who introduced a statistical theory of investment. Kelly [180] considered a quite different 
approach, closer to the spirit of this chapter, for horse race markets of the type described 
in Section 9.3, and assumed an independent, identically distributed sequence of market 
vectors. Breiman [41] extended Kelly’s framework to general markets with i.i.d. returns. 
The assumption of independence was substantially relaxed by Algoet and Cover [6], who 
considered stationary and ergodic markets; see also Algoet [4,5], Walk and Yakowitz [304], 
Gyorfi and Schafer [136], and Gyorfi, Lugosi, and Udina [135] for various results under 
such general assumptions. 

The problem of sequential investment of arbitrary markets was first considered by Cover 
and Gluss [71], who used Blackwell’s approachability (see Section 7.7) to construct an 
investment strategy that performs almost as well as the best constantly rebalanced portfolio 
if the market vectors take their values from a given finite set. The universal portfolio 
strategy, discussed in Section 10.4, was introduced and analyzed in a pioneering work of 
Cover [70]. The minimax value W,,(Q) for the class of all constantly rebalanced portfolios 
was found by Ordentlich and Cover [229]. The bound for the universal portfolios over 
the class of constantly rebalanced portfolios was obtained by Cover and Ordentlich [72]. 
The general results of Theorems 10.1 and 9.2 were given in Cesa-Bianchi and Lugosi [52]. 
The EG investments strategy was introduced and analyzed by Hembold, Schapire, Singer, 
and Warmuth [158]. Theorem 10.4 is due to them (though the proof presented here is taken 
from Stoltz and Lugosi [279]). 

The model of investment with side information described in Section 10.6 was introduced 
by Cover and Ordentlich [72], and Corollary 10.1 is theirs. Gyorfi, Lugosi, and Udina [135] 
choose the side information by nonparameteric methods and construct investment strategies 
with universal guarantees for stationary and ergodic markets. 

Singer [269] considers the problem of “tracking the best portfolio” and uses the tech- 
niques described in Section 5.2 to construct investment strategies that perform almost as 
well as the best investment strategy, in hindsight, which is allowed to switch between 
portfolios a limited number of times. 
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Kalai and Vempala [175] develop efficient algorithms for approximate calculation of the 
universal portfolio. 

Vovk and Watkins [303] describe various versions of the universal portfolio, considering, 
among others, the possibility of “short sales,” that is, when the portfolio vectors may have 
negative components (see also Cover and Ordentlich [73]). 

Cross and Barron [78] extend the universal portfolio to general smoothly parameterized 
classes of investment strategies and also extend the problem of sequential investment to 
continuous time. 

One important aspect of sequential trading that we have ignored throughout the chapter 
is transaction costs. Including transaction costs in the model is a complex problem that has 
been considered from various different points of view. We just mention the work of Blum 
and Kalai [33], Iyengar and Cover [167], Iyengar [166], and Merhav, Ordentlich, Seroussi, 
and Weinberger [215]. 

Borodin, El-Yaniv, and Gogan [37] propose ad hoc investment strategies with very 
convincing empirical performance. 

Stoltz and Lugosi [279] introduce and study the notion of internal regret (see Section 4.4) 
in the framework of sequential investment. 


10.8 Exercises 


10.1 Show that for any constantly rebalanced portfolio strategy B, the achieved wealth S,,(B, x”) is 
invariant under permutations of the sequence X4, . . . , Xj. 


10.2 Show that the wealth S,,(P,x”) achieved by the universal portfolio P is invariant under 
permutations of the sequence x, ..., X,. Show that the same is true for the minimax optimal 
investment strategy P* (with respect to the class of constantly rebalanced portfolios). 

10.3 (Universal portfolio exceeds value line index) Let P be the universal portfolio strategy 
based on the uniform density u. Show that the wealth achieved by P is at least as large as the 
geometric mean of the wealth achieved by the individual stocks, that is, 


1/m 
m 


SP, x) > [TTT [x 


j=l t=1 


(Cover [70].) Hint: Use Jensen’s inequality twice. 

10.4 Let F be a finite class of forecasters. Show that the investment strategy induced by the mixture 
forecaster over this class is just the strategy described in the first example of Section 10.2 
based on the class Q of investment strategies induced by members of F. 

10.5 Prove Lemma 10.2. 


10.6 Consider the following randomized approximation of the universal portfolio. Observe that 
an interpretation of the identity S,,(P, x”) = tb S,(B, x” )u(B)dB is that the wealth achieved 
by the universal portfolio is the expected value, with respect to the density u, of the wealth 
of all constantly rebalanced strategies. This expectation may be approximated by randomly 
choosing N vectors B,,..., By according to the density u and distributing the initial wealth 
uniformly among them just as in the example of Section 10.2. Investigate the relationship of 
the wealth achieved by this randomized strategy and that of the universal portfolio. What value 
of N do you suggest? (Blum and Kalai [33].) 


10.7 Consider a market of m > 2 assets and class Q of all investment strategies that rebalance 
between two assets. More precisely, Q is the class of all constantly rebalanced portfolios 


10.8 


10.9 


10.10 


10.11 
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such that the probability vector B € D characterizing the strategy has at most two nonzero 
components. Determine tight bounds for the minimax logarithmic wealth ratio W,,(Q). Define 
and analyze a universal portfolio for this class. 


(Universal portfolio for smoothly parameterized classes) Let Q be a class of static investment 
strategies that is, each Q € Q is given by a sequence B,,...,B,, of portfolio vectors (i.e., 
B, € D). Assume that the strategies in Q are parameterized by a set of vectors © C R7, that is, 
Q = {Qo = (Bo, ..., Bon) : 0 € ©}. Assume that © is a convex, compact set with nonempty 
interior, and the parameterization is smooth in the sense that 


Bo, = Bos 


|<cle—6'| 

for all 6, 6’ € © and t = 1,...,, where c > 0 is a constant. Let u be a bounded density on 

© and define the generalized universal portfolio by 

So Bie Si—1(Qo, x= Du(0) 0 
Jo S:-1(Qa, x17!) u0) dO 

where Bi, denotes the jthe component of the portfolio vector Bo. Show that if the price 


relatives x; fall between 1/C and C for some constant C > 1, then the worst-case logarithmic 
wealth ratio satisfies 


Pj) = 


; pel wgem. tH 1; 


Sn(Qo, X”) 
sup sup In —-———__ = O (d Inn) 
eP oe SCP, x") 


(Cross and Barron [78]). 


(Switching portfolios) Define class Q of investment strategies that can switch buy-and-hold 
strategies at most k times during n trading periods. More precisely, any strategy in Q € Q is 
characterized by a sequence i4, ..., in € {1,..., m} of indices of assets with size(i4, ... , in) < 
k (where size denotes the number of switches in the sequence; see the definition in Section 5.2) 
such that, at time ż, Q invests all its capital in asset i,. Construct an efficiently computable 
investment strategy P whose worst-case logarithmic wealth ratio satisfies 


Sn ”) 
sup sup In ————— (2.x < (K+ I)Inm +kint 


x geQ 8, (P, 2 - 
(Singer [269]). Hint: Combine Theorem 10.1 with the techniques in Section 5.2. 
(Switching constantly rebalanced portfolios) Consider now class Q of investment strategies 
that can switch constantly rebalanced portfolios at most k times. Thus, a strategy in Q € Q 
is defined by a sequence of portfolio vectors B;,...,B, such that the number of times t = 
1,...,2— 1 with B, 4 B,,, is bounded by k. Construct an efficiently computable investment 
strategy P whose logarithmic wealth ratio satisfies 


Si(Q, x") C n k 
i SP, x") < ci; («+ 1)lnm + (n — 1)H (=) 


whenever the price relatives x;, fall between the constants c < C. Hint: Combine the EG 
investment strategy with the algorithm for tracking the best expert of Section 5.2. 


(Internal regret for sequential investment) Given any investment strategy Q, one may define 
its internal regret (in analogy to the internal regret of forecasting strategies; see Section 4.4), 
for any i, j € {1,..., m}, by 


Qi! . x 
TURS =a Qox 


where the modified portfolio Qe is defined such that its 7th component equals 0, its jth 
component equals Q ;; + Q;,,, and all other components are equal to those of Q,. Construct 
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an investment strategy such that if the price relatives are bounded between two constants c 
and C, then max;,; Ra, jn = O(n'/?) and, at the same time, the logarithmic wealth ratio with 
respect to the class of all constantly rebalanced portfolios is also bounded by O(n!/?) (Stoltz 
and Lugosi [279]). Hint: First establish a linear upper bound as in Section 10.4 and then use 
an internal regret minimizing forecaster from Section 4.4. 


10.12 Prove Theorem 10.5 and Corollary 10.1. 


Il 


Linear Pattern Recognition 


11.1 Prediction with Side Information 


We extend the protocol of prediction with expert advice by assuming that some side 
information, represented by a real vector x; € Rf, is observed at the beginning of each 
prediction round ¢. In this extended protocol we study experts and forecasters whose 
predictions are based on linear functions of the side information. 

Let the decision space D and the outcome space Y be a common subset of the real 
line R. Linear experts are experts indexed by vectors u € R°. In the sequel we identify 
experts with this corresponding parameter, thus referring to a vector u as a linear expert. 
The prediction f,,, of a linear expert u at time f is a linear function of the side information: 
fur = U- X. Likewise, the prediction p; of the linear forecaster at time t is Di = W;-1 - Xr, 
where the weight vector w;_, is typically updated making use of the side information xz. 

This prediction protocol can be naturally related to a sequential model of pattern recogni- 
tion: by viewing the components of the side-information vector as features of an underlying 
data element, we can use linear forecasters to solve pattern classification or regression 
problems, as described in Section 11.3 and subsequent sections. 

As usual, we define the regret of a forecaster with respect to expert u € R? by 


R% = Ey — L, = Ð (tO y) = L x, 90), 


t=1 


where £ is a fixed loss function. 

In some applications we slightly depart from the linear prediction model by considering 
forecasters and experts that, given the side information x, predict with o(u- x), where 
o : R —> Ris a nonlinear transfer function. This transfer function, if chosen in conjunction 
with a specific loss function, makes the proof of regret bounds easier. 

In this chapter we derive bounds on the regret of forecasters using weights of the form 
w= VỌ, where © is a potential function. Unlike in the case of the weighted average 
forecasters using the advice of finitely many experts, we do not define potentials over the 
regret space (which is now a space of functions indexed by R“). Rather, we generalize the 
approach of defining potentials over the gradient of the loss presented in Section 2.5 for 
the exponential potential. To carry out this generalization we need some tools from convex 
analysis that are described in the next section. In Section 11.3 we introduce the gradient- 
based linear forecaster and prove a general bound for its regret. This bound is specialized to 
the polynomial and exponential potentials in Section 11.4, where we analyze the gradient- 
based forecaster using a nonlinear transfer function to control the norm of the loss gradient. 
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Section 11.5 introduces the projected forecaster, a gradient-based forecaster whose weights 
are kept in a given convex and closed region by means of repeated projections. This 
forecaster, when used with a polynomial potential, enjoys some remarkable properties. In 
particular, we show that projected forecasters are able to “track” the best linear expert and 
to dynamically tune their learning rate in a nearly optimal way. 

In the rest of the chapter we explore potentials that change over time. The regret bounds 
that we obtain for the square loss grow logarithmically with time, providing an exponential 
improvement on the bounds obtained using static potentials. In Section 11.9 we show that 
these bounds cannot be improved any further. Finally, in Section 11.10 we obtain similar 
improved regret bounds for the logarithmic loss. However, the forecaster that achieves such 
logarithmic regret bounds is different, because it is based on a mixture of experts, similar, 
in spirit, to the mixture forecasters studied in Chapter 9. 


11.2 Bregman Divergences 


In this section we make a digression to introduce Bregman divergences, a notion that plays 
a key role in the analysis of linear forecasters. 

Bregman divergences are a natural way of defining a notion of “distance” on the basis 
of an arbitrary convex function. To ensure that these divergences enjoy certain useful 
properties, the convex functions must obey some restrictions. 

We call Legendre any function F : A — R such that 


1. A C R? is nonempty and its interior int(A) is convex; 

2. F is strictly convex with continuous first partial derivatives throughout int(A); 

3. if X1, X2,...E€ Á is a sequence converging to a boundary point of A, then 
IVE (x,)|| > coasn > ow. 


The Bregman divergence induced by a Legendre function F : A > R is the nonnegative 
function Dr : A x int(A) > R defined by 


Dr(u, v) = F(u) — F (v) — (u— v). VF (v). 


Hence, the Bregman divergence from u to v is simply the difference between F(u) and 
its linear approximation via the first-order Taylor expansion of F around v. Due to the 
convexity of F, this difference is always nonnegative. Clearly, if u = v, Dp (u, v) = 0. 
Note also that the divergence is not symmetric in the arguments u and v, so we will speak 
of Dp (u, v) as the divergence from u to v. 


Example 11.1. The half of the squared euclidean distance 5 |u — v||* is the (symmetric) 
Bregman divergence induced by the half of the squared euclidean norm F(x) = 5 \|x||?. In 
this example we may take A = R’. 


Example 11.2. The unnormalized Kullback—Leibler divergence 


d d 
Dep.) = >> p mZ +) ai- pi) 
i=l i 


i=l 
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Figure 11.1. This figure illustrates the generalized pythagorean inequality for squared euclidean 
distances. The law of cosines states that ||u— w|? is equal to |lu— wI? + Iw — wl? — 
2 |u — w'|| |w" — w|| cos, where @ is the angle at vertex w’ of the triangle. If w’ is the closest 
point to w in the convex set S, then cos@ < 0, implying that ||u — w||? > |Ju— wI? + [lw — wil’. 


is the Bregman divergence induced by the unnormalized negative entropy 


d d 
F(p) = do pilnpi — Yo pi 
=l j= 


defined on A = (0, o0). 


The following result, whose proof is left as an easy exercise, shows a basic relationship 
between the divergences of three arbitrary points. This relationship is used several times in 
subsequent sections. 


Lemma 11.1. Let F : A — R be Legendre. Then, for allu € A and all v, w € int(A), 


Dp(u, v) + Dr (v, w) = Dr (u, w) + (u — v)(VF(w) — VF (v)). 


We now investigate the properties of projections based on Bregman divergences. Let F : 
A — R be a Legendre function and let S C R? be a closed convex set with S N A Æ Ø. 
The Bregman projection of w € int(A) onto S is 


argmin Dp (u, w). 

ueSNA 
The following lemma, whose proof is based on standard calculus, ensures existence and 
uniqueness of projections. 


Lemma 11.2. For all Legendre functions F : A —> R, for all closed convex sets S C R! 
such that AN S 4 Ø, and for all w € int(A), the Bregman projection of w onto S exists 
and is unique. 


With respect to projections, all Bregman divergences behave similarly to the squared 
euclidean distance, as shown by the next result (see Figure 11.1). 
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Lemma 11.3 (Generalized pythagorean inequality). Let F be a Legendre function. 
For all w € int(A) and for all convex and closed sets S C R? with SNA +Ø, if 
w = argminyesn 4 Dr(v, w) then 


Dr(u, w) > Dr(u, w) + Dr(w,w) forall u € S. 
Proof. Define the function G(x) = Dp (x, w) — Dr (x, w). Expanding the divergences 
and simplifying, we note that 
G(x) = —F (w) — (x — w)V F (w) + F(W) + x — W)VF(w’). 
Thus G is linear. Let x, = œu + (1 — a)w’ be an arbitrary point on the line joining u and 
w’. By linearity, G(x,) = œ G (u) + (1 — w@)G(w’) and thus 
Dr (Xa, w) = Dr (Xa, w’) 
= a(Dr(u, w)— Dr (u, w)) +(1-—a@)Dp (w, w) . 
For a > 0, this leads to 
Dr(u, w) — Dr (u, w’) — Dr(w, w) 
_ Dr(&a, W) — Dr (Xa, W) — Dr (wW, w) D —Dr (Xa, W) 


a ’ 


a a 


where we used DF (Xx, W) > DF (w y w). This last inequality is true since w’ is the point in 
S with smallest divergence to w and x, € S since u € S and S is convex by hypothesis. 
Let D(x) = Dr (x, w ). To prove the theorem it is then enough to prove that D(x,)/a = 0 
for some a > 0. Indeed, 


_ D(&) . D(w +a(u—w))— Dw’) 
lim —— = lim ; 
œa—> 0+ Q a—> Or Q 


The last limit is the directional derivative D{_,,,(w’) of D in the direction u — w’ evaluated at 
w (the directional derivative exists in int(A) because of the second condition in the definition 
of Legendre functions; moreover, if w € int(A), then the third condition guarantees that w’ 
does not belong to the boundary of A). Now, exploiting a well-known relationship between 


the directional derivative of a function and its gradient, we find that 
Di_w (W) = (u— w): VD(w). 


Since D is differentiable and nonnegative and D(w’) = 0, we have VD(w’) = 0. This 
completes the proof. W 


Note that Lemma 11.3 holds with equality whenever S is a hyperplane (see Exer- 
cise 11.2). 

To derive some additional key properties of Bregman divergences, we need a few basic 
notions about convex duality. Let F : A —> R be Legendre. Then its Legendre dual (or 
Legendre conjugate) is the function F* defined by 

F*(u) = sup(u -v— F(v)). 
veA 
The conditions defining Legendre functions guarantee that whenever F is Legendre, 
then F* : A* > R is also Legendre and such that A* is the range of the mapping 
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VF : int(A) > R’. Moreover, the Legendre dual F** of F* equals F (see, e.g., Sec- 
tion 26 in Rockafellar [247]). The following simple identity relates a pair of Legendre 
duals. 


Lemma 11.4. For all Legendre functions F, F(u) + F*(w’) =u-w if and only if w = 
VF (u). 


Proof. By definition of Legendre duality, F*(u’) is the supremum of the concave function 
G(x) =u’ - x — F(x). If this supremum is attained at u € R’, then VG(u) = 0, which is 
to say u = VF (u). On the other hand, if u’ = VF (u), then u is a maximizer of G(x), and 
therefore F*(u')=u-u'—F(u). E 


The following lemma shows that gradients of a pair of dual Legendre functions are 
inverses of each other. 


Lemma 11.5. For all Legendre functions F, VF* =(VF)"!. 


Proof. Using Lemma 11.4 twice, 


u =VF(u) ifandonlyif F(u)+ F*(u) =u. u 
u = VF*(u) ifandonlyif F*qw)+F"(w=u-uW. 


Because F*™* = F, the lemma is proved. W 


Example 11.3. The Legendre dual of the half of the squared p-norm i lull; p = 2, is 
the half of the squared g-norm i lullŽ, where p and q are conjugate exponents; that is, 
1/p + 1/4 = 1. The euclidean norm 5 lul]? is the only self-dual norm (it is the dual of 
itself). The squared p-norms are Legendre, and therefore the gradients of their duals are 
inverses of each other, 


sgn(u;) |u|?" 


2\-!1 2 
a= and = (Vi jull3) = Viljull; . 
P 


(Vs llull;), = 


where, as before, 1/p + 1/q = 1. 


Example 11.4. The function F(u) = e"! +---+e4 has gradient VF (u); = e" whose 
inverse is VF'*(v); = Inv;, v; > 0. Hence, the Legendre dual of F is 


d 
F*(v) = X viiny; — 1). 


Note that if v lies on the probability simplex in R? and H (v) is the entropy of v, then 
F*(v) = —(H (v) + 1). 


Example 11.5. The hyperbolic cosine potential 


d 
1 u u 
La 2 +e 
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has gradient with components equal to the hyperbolic sine VF (u); = sinh(u;) = 5(e" — 
e`"). Therefore, the inverse gradient is the hyperbolic arcsine, VF*(v); = arcsinh(v;) = 
In(, [v2 +1+ vi), whose integral gives us the dual of F 


d 


F*(v) = 5 (r arcsinh(v;) — \/v? + i) ; 


i=1 


We close this section by mentioning an additional property that relates a divergence based 
on F to that based on its Legendre dual F*. 


Proposition 11.1. Let F : A— R be a Legendre function. For all u, v € int(A), if uw = 
VF (u) and v' = VF (vy), then Dr(u, v) = Dp» (v’, u’). 


Proof. We have 


Dr(u, v) = F (u) — F (v) — (u — v)- VEF (v) 

= F(u) — F (v) — u- v). v 

=u -u — F*(u) -v -v+ F*(v)— (u-v) v 
(using Lemma 11.4) 

=u -u — F*(u) + F*(v)- u. v 

= F*(v) — F*q’)—(v -w)-u 

= F*(v) — F* (u) — (v — u’) - VF* (u) 
(using u’ = VF (u) and Lemma 11.5) 

= De (v, u’). 


This concludes the proof. E 


11.3 Potential-Based Gradient Descent 


Now we return to the main topic of this chapter introduced in Section 11.1, that is, to the 
design and analysis of forecasters that use the side-information vector x, and compete with 
linear experts, or, in other words, with reference forecasters whose prediction takes the 
form fua, = U- x; for some fixed vector u € R. In this and the four subsequent sections 
we focus our attention on linear forecasters whose prediction, at time t, takes the form 
P: = W;_-1-X;. The weight vector w, used in the next round of prediction is determined 
as a function of the current weight w,—1, the side information x,, and the outcome y,. The 
forecasters studied in these sections differ in the way the weight vectors are updated. We 
start by describing a family of forecasters that update their weights by performing a kind 
of gradient descent based on an appropriately defined potential function. 

To motivate the gradient descent forecasters defined below, we compare them first to 
the potential-based forecasters introduced in Chapter 2 in a different setup. Recall that 
in Chapter 2 we define weighted average forecasters using potential functions in order to 
control the dependence of the weights on the regret. More precisely, we define weights at 
time t by w,_; = V®(R,_;), where R,_; is the cumulative regret up to time t — 1. The 
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convexity of the loss function, through which the regret is defined, entails (via Lemma 2.1) 
an invariant that we call the Blackwell condition: w;_; -r,; < 0. This invariant, together 
with Taylor’s theorem, is the main tool used in Theorem 2.1. 

If ® is a Legendre function, then Lemma 11.5 provides the dual relations w, = V ®(R,) 
and R, = V®*(w,). As ® and ®* are Legendre duals, we may call primal weights the 
regrets R, and dual weights the weights w;, where V ® maps the primal weights to the dual 
weights and V ®* performs the inverse mapping. Introducing the notation 0, = R; to stress 
the fact that we now view regrets as parameters, we see that 0, satisfies the recursion 


6,=0,-1 +r (primal regret update). 


Via the identities 0, = R; = V ®*(w,), the primal regret update can be also rewritten in 
the equivalent dual form 


V®*(w,) = V®*(w_1) +4; (dual regret update). 


To appreciate the power of this dual interpretation for the regret-based update, consider 
the following argument. The direct application of the potential-based forecaster to the class 
of linear experts requires a quantization (discretization) of the linear coefficient domain R? 
in order to obtain a finite approximation of the set of experts. Performing this quantization 
in, say, a bounded region [—W, W]? of R? results in a number of experts of the order of 
(W/e)*, where e is the quantization scale. This inconvenient exponential dependence on 
the dimension d can be avoided altogether by replacing the regret minimization approach 
of Chapter 2 with a different loss minimization method. This method, which we call 
sequential gradient descent, is applicable to linear forecasters generating predictions of the 
form P; = 0;—1 - X, and uses the weight update rule 


0, = 0,1 —AVE;(0;_1) (primal gradient update), 


where 0, € R¢, à > 0 is an arbitrary scaling factor, and we set €,(0;-1) = €(0;_1 - Xr, Ye). 
With this method we replace regret minimization taking place in RY (where N = R¢ in this 
case) with gradient minimization taking place in R“. Note also that, due to the convexity 
of the loss functions €(-, y), minimizing the gradient implies minimizing the loss. 

In full analogy with the regret minimization approach, we may now introduce a potential 
®, the associated dual weights w, = V ®(6,), and the forecaster 


Pr = W1: X (gradient-based linear forecaster) 
whose weights w;_; are updated using the rule 
VO®*(w,) = V®*(w;_1) — AV; (w1) (dual gradient update). 
By rewriting the dual gradient update as 
0, = bi- — AVE:(Wr-1) 


we see that this update corresponds to performing a gradient descent step on the weights 
0,, which are the image of w, according to the bijection V ®*, using V£, (w,—1) rather than 
V£, (0,—1) (see Figure 11.2). 


300 Linear Pattern Recognition 


Figure 11.2. An illustration of the dual gradient update. A weight w,_; € R°? is updated as follows: 
first, w,_; is mapped to the corresponding primal weight 6,_; via the bijection V®*. Then a gradient 
descent step is taken obtaining the updated primal weight 0,. Finally, 0, is mapped to the dual weight 
w; via the inverse mapping V®. The curve on the left-hand side shows a surface of constant loss. 
The curve on the right-hand side is the image of this surface according to the mapping V ®, where ® 
is the polynomial potential with degree p = 2.6. 


Since V® is the inverse of V®*, it is easy to express explicitly the update in terms of 
Wi: 


w= Vo(Vor(w,_) = AVE(W,-1)). 


A different intuition on the gradient-based linear forecaster is gained by observing that w, 
may be viewed as an approximate solution to 


min [Do (u, W1) + atw]. 


ucR? 


In other words, w, expresses a tradeoff between the distance from the old weight w,_, 
(measured by the Bregman divergence induced by the dual potential ®*) and the loss 
suffered by w, if the last observed pair (x,, y;) appeared again at the next time step. To 
terminate the discussion on w,, note that w, is in fact characterized as a solution to the 
following convex minimization problem: 


min| Do: (u, wii) + &(€:(Wi-1) + (= wD w) 


ucR? 


This second minimization problem is an approximated version of the first one because 
the term ¢;(w;_1) + (u — w;_1)V €;(W;_1) is the first-order Taylor approximation of ¢,;(u) 
around W;—1. 

We call regular loss function any convex, differentiable, nonnegative function £ : R x 
R — R such that, for any fixed x, € R? and yı € R, the function €,(w) = £(w - x,, y;) is 
differentiable. The next result shows a general bound on the regret of the gradient-based 
linear forecaster. 


Theorem 11.1. Let € be a regular loss function. If the gradient-based linear forecaster 
is run with a Legendre potential ®, then, for all u € R, the regret R,(u) = L, — L, (u) 
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Satisfies 


n 


1 1 
R,(u) < ~D- (U, Wo) + > XO Dos (wri, Wi). 


t=1 


Proof. Consider any linear forecaster using fixed weights u € R°. Then 


£;(Wr-1) < £U) — (U — w1) VEW) 


(by Taylor’s theorem and using convexity of £) 
1 
= E(u) + U — wri): (VE (w) — V*(w,-1)) 
(by definition of the dual gradient update) 
1 
= ¢,(u) + z (Par, w,-1) — Da (U, w,) + Do (Wii, W,)) 
(by Lemma 11.1). 


Summing over ¢ and using the nonnegativity of Bregman divergences to drop the 
term — Da (u, w„) completes the proof. E 


At a first glance, the bound of Theorem 11.1 may give the impression that the larger 
the à, the smaller the regret is. However, a large value of à may make the weight vectors 
w, change rapidly with time, causing an increase in the divergences in the second term. In 
the concrete examples that follow, we will see that the learning rate à needs to be tuned 
carefully to obtain optimal performance. 

It is interesting to compare the bound of Theorem 11.1 (for A = 1) with the bound of 
Theorem 2.1. Rewriting both bounds in primal weight form, we obtain, setting wo = V®(0), 


O(R,) < DO) + $` Dor, 0-1) 


t=1 


R,(u) < Do(0, VO*(u)) + $` Do (9), 0,1), 


t=1 


where the 0, are updated using the primal regret update (as in Theorem 2.1) and the 0, 
are updated using the primal gradient update (as in Theorem 11.1). Note that, in both 
cases, the terms in the sum on the right-hand side are the divergences from the old primal 
weight to the new primal weight. However, whereas the first bound applies to a set of N 
arbitrary experts, the second bound applies to the set of all (continuously many) linear 
experts. 


11.4 The Transfer Function 


To add flexibility to the gradient-based linear forecaster, and to make its analysis easier, 
we constrain its predictions P; = w;_1 - X; by introducing a differentiable and nondecreas- 
ing transfer function o : R —> R and letting P, = o (w;—1 - x;). Similarly, we redefine the 
predictions of expert u as o(u- x,). and let £7 (w;_1) = L(o (Ww -X,), Yt): The forecaster 
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predicting with a transfer function, which we simply call gradient-based forecaster, is 
sketched in the following: 


THE GRADIENT-BASED FORECASTER 
Parameters: learning rate A > 0, transfer function o. 
Initialization: wọ = V ®(0). 
For each round t = 1, 2,... 


(1) observe x, and predict P, = o (W;—1 - X+); 
(2) get y; € R and incur loss £? (w,—1) = €(P;, Yr); 
(3) let w, = V®(Vb*(w,_1) — AVE? (w,_1)). 


The regret of the gradient-based forecaster with respect to a linear expert u takes the form 


Rew) = P (Ew) — gw). 
t=1 

Note that, for fairness, the loss of the forecaster P, = o (W;—1 - X+) is compared with the 
loss of the “transferred” linear expert o (u - x;). Instances of the gradient-based forecaster 
obtained by considering specific potentials correspond to well-known pattern recognition 
algorithms. For example, the Widrow—Hoff rule [310] w; = w_1 — A(wW;-1 - X; — yr) is 
equivalent to the gradient-based forecaster using the quadratic potential (i.e., the polynomial 
potential with p = 2), the square loss, and the identity transfer function o(p) = p. The EG 
algorithm of Kivinen and Warmuth [181] corresponds to the forecaster using the exponential 
potential. Regret bounds for these concrete potentials are derived later in this section. 

To apply Theorem 11.1 with transfer functions, we have to make sure that the loss 
£(o(w- x), y) is a convex function of w for all y. Because w -x is linear, it suffices 
to guarantee that £(o (v), y) is convex in v. We call nice pair any pair (ø, £) such that 
£(a(-), y) is convex for all fixed y. Trivially, any regular loss function forms a nice pair 
with the identity transfer function. Less trivial examples are as follows. 


Example 11.6. The hyperbolic tangent o(v) = (e” — e~)/(e” +e™”) € [-1, 1] and the 
entropic loss 


I+y. I+y l-y, 1l-y 
In In 
2 1+ p 2 l-—p 


tp, y)= 


form a nice pair. 


Example 11.7. The logistic transfer function oø (v) = (1 + e~”)~! € [0, 1] and the Hellinger 


loss £(p, y) = (VYP — Jy) +(/1T— p— VI = y} form a nice pair. 


To avoid imposing artificial conditions on the sequence of outcomes and side information, 
we focus on nice pairs (ø, £) satisfying the additional condition 
( dé(a( 


2 
ao) < æ Lov), y) for some a > 0. 
Vv 
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We call such pairs a-subquadratic. The nice pair of Example 11.6 is 2-subquadratic and 
the nice pair of Example 11.7 is 1/4-subquadratic. The nice pair formed by the identity 
transfer function and the square loss £(p, y) = 5( p — y} is 1-subquadratic. (To analyze the 
gradient-based forecaster, it is more convenient to work with this definition of square loss 
equal to half of the square loss used in previous chapters.) We now illustrate the regret bounds 
that we can obtain through the use of w-subquadratic nice pairs. Let R7 (u) = Le — L? (u), 
where 
Te = > L(o (w1 -X,), yr) and L? (u) = X e(o -X;), yr). 


t=1 t=1 


Polynomial Potential 

The polynomial potential ||u, IF defined in Chapter 2 is not Legendre because, owing to the 
(-)4 operator, it is not strictly convex outside of the positive orthant. To make this potential 
Legendre, we may redefine it as ®,(u) = 5 \|u||2,, where we also introduced an additional 
scaling factor of 1/2 so as to have $5 (u) = ®, (u) = 5 lul (see Example 11.3). It is easy 
to check that all the regret bounds we proved in Chapter 2 for the polynomial potential still 


hold for the Legendre polynomial potential. 


Theorem 11.2. For any a-subquadratic nice pair (o, £), if the gradient-based fore- 
caster using the Legendre polynomial potential ®, is run with learning rate à = 
2e/((p —l)a X?) on a sequence (Xj, y1), (X2, Y2)... € R? x R, where 0 < e < 1, then 
fordllue R7 and for all n > 1 such that max;=1,....n Ixl, < Xp, 


yassi 


o 2 = 2 
as _ La) lulls f (p — l)a Xp 
"~1-e (l-e) 4 


where q is the conjugate exponent of p. 


Proof. We apply Theorem 11.1 to the Legendre potential ®,(u) and to the convex loss 
£? (-) and obtain 


R, @) < =D (u, Wo) + : ar (Wr-1, Wr). 
i E A t=1 : 
Since the initial primal weight 09 is 0, wo = V®,(0) = 0 and ©, (wo) = 0. This implies 
Do, (u, Wo) = ©,(u). As for the other terms, we simply observe that, by Proposition 11.1, 
Do, (Wr-1, Wr) = Do, Or, 0;-1), where 0, = V®,(w;) and 0, = 0;_; — A V£? (w;_1). We 
can then adapt the proof of Corollary 2.1 replacing N by d and r, by —AV£? (w;_1) and 
adjusting for the scaling factor 1/2. This yields 


® ie 
R; (u) < em F x X De, (Wi, w) 
t=1 
® die 
< WO sp D> DO [ve wol, 
t=1 


IA 


®,(u) AH deo), yi) 
L Tg DD ( dv, 


2 
2 
) EA 
Vp=Wr-1°Xr 
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< W 
T à 


x n 
+o- D5 De EF (wi) Il, 
t=1 


(as (o, £) is a-subquadratic) 
< W 
~ HX 


a A 2 Fo 
+P-D7 XÊ. 


Note that R7 (u) = T? — L? (u). Rearranging terms, substituting our choice of A, and using 
the equality 5D, (u) = lul yields the result. W 


Choosing, say, € = 1/2, the bound of Theorem 11.2 guarantees that the cumulative loss 
of the gradient-based forecaster is not larger than twice the loss of any linear expert plus 
a constant depending on the expert. If ¢ is chosen to be a smaller constant, the factor of 2 
in front of L? (u) may be decreased to any value greater than 1, at the price of increasing 
the constant terms. One may be tempted to choose the value of the tuning parameter £ so 
as to minimize the obtained upper bound. However, this tuned ¢ would have to depend on 
the preliminary knowledge of L? (u). Since the bound holds for all u, and u is arbitrary, 
such optimization is not feasible. In Section 11.5 we introduce a so-called self-confident 
forecaster that dynamically tunes the value of the learning rate À to achieve a bound of the 
order ,/L°(u) for all u for which ®,(u) is bounded by a constant. Note that this bound 
behaves as if à had been optimized in the bound of Theorem 11.2 separately for each u in 
the set considered. 


Exponential Potential 
As for the polynomial potential, the exponential potential defined as 


i 4 
—In ye ei 
se 


is not Legendre (in particular, the gradient is constant along the line uw; = u2 =--- = 
uq, and therefore the potential is not strictly convex). We thus introduce the Legendre 
exponential potential ®(u) = e"! + - -- + e" (the parameter 7, used for the exponential 
potential in Chapter 2, is redundant here). Recalling Example 11.4, V®*(w); = In w;. For 
this potential, the dual gradient update w, = VO(V O*(w;_1) —A Ve, (wr—1)) can thus be 
written as 


Wie = exp(Inwj;-1 — AVE(W_1)i) = Wisi ete fori =1,...,d. 
Adding normalization, which we need for the analysis of the regret, results in the final 


weight update rule 


Wi -167° Vei(Wr-1)i 


Dai Wj r-1€ 


Wit = A 
—AVE(Wi-1)j 


This normalization corresponds to a Bregman projection of the original weight onto the 
probability simplex in R“, where the projection is taken according to the Legendre dual 


d 


©*(u) = X uinu; — 1) 


i=1 
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(see Exercise 11.4). A thorough analysis of gradient-based forecasters with projected 
weights is carried out in Section 11.5. 

The gradient-based forecaster using the Legendre exponential potential with projected 
weights is sometimes called the EG (exponentiated gradient) algorithm. Although we used 
the Legendre version of the exponential potential to be able to define EG as a gradient-based 
forecaster, it turns out that the original potential is better suited to prove a regret bound. For 
this reason, we slightly change the proof of Theorem 11.1 and use the standard exponential 
potential. Finally, because the EG forecaster uses normalized weights, we also restrict our 
expert class to those u that belong to the probability simplex in R¢. 


Theorem 11.3. For any a-subquadratic nice pair (o, £), if the EG forecaster is run 
with learning rate à = 2¢e/(a x2) on a sequence (X1, y1), (X2, y2)... € R? x R, where 
0 <e <1, then for all u € R? in the probability simplex and for all n > 1 such that 
max;=1,....n |X+lloo < Xoo, 


jes 


pA < En) a X2 Ind 
"—~1-e 2e1—s) 


Proof. Recall the (non-Legendre) exponential potential, with n = 1, 
d 
(u) = In Se ar 
i=l 


We now go through the proof of Theorem 11.1 using the relative entropy 


d 


Ui 
D(allv) = i ln —. 

liv) 2 uj In 
(see Section A.2) in place of the divergence Do«(u, v). Example 11.2 tells us that the 
relative entropy is the Bregman divergence for the unnormalized negative entropy potential 


d d 


@(u) = >on; Inu; — È u;i, u € (0, 00)’, 


i=l i=1 


in the special case when both u and v belong to the probability simplex in R¢. 
Let w! , = w; s167 Yr and let w, be w, normalized. Just as in Theorem 11.1, we 
begin the analysis by applying Taylor’s theorem to £: 


L (w1) — £ 0) < —(u — w1) VE (w1). 


Introducing the abbreviation z = AV £? (w;_1) and the new vector v with components v; = 
W;-1 ` Z — Zi, we proceed as follows: 


—(U — W1): Z 


d d 
=-u-z+w,_;-z-—In (>: rte] + In (>: ve") 


i=l i=l 


d d 
=-u-z-—In bs me) + In (£ mite] 
i=1 =I 
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d 
=u ; Ine™™ —In (Eru 1e “J+ In (Ere) 
1 W jpe 74 a 
= > uj In ae ~}t+In XO wire” 
: Wjt-1 Ji Wir—-1e7 7 é 


d 
= D(ul|w,—1) — D(ul|w,) + In (> wue") : 


i=1 
Note that, as in Theorem 11.1, we have obtained a telescoping sum. However, unlike 
Theorem 11.1, the third term is not a relative entropy. 

On the other hand, bounding this extra term is not difficult. Since w,_; belongs to the 
simplex, we may view v1, ..., Va as the range of a zero-mean random variable V distributed 
according to w,—1. To this end, let 


dé(o(v), yr) 


C= 
i dv 


IIx: Ilo 


v=W;—1'X; 


so that z; € [—AC;, AC;]. Applying Hoeffding’s inequality (Lemma A.1) to V then yields 


d 272 
àC 
n( ) vine") = A : 
i=1 


Summing overt = 1, ..., gives 


n 


D 
So (E(w) -g w) < a o a 


t=1 


where the negative term —D(ul|w,) has been discarded. Using the assumption that 
(o, £) is a-subquadratic, we have C : < al? (wr) X 2 Moreover, 09 = 0 implies that, 
after normalization, Wọ = (1/d,..., 1/d). This gives D(u||wo) < Ind (see Examples 11.2 
and 11.4). Substituting these values in the above inequality we get 
Ru) < Ind Fs Che qe. 
u a es 
À 2 

Note that this inequality has the same form as the corresponding inequality at the end of 
the proof of Theorem 11.2. Substituting our choice of à and rearranging yields the desired 
result. W 


Note that, as for the analysis in Chapter 2, we can obtain a bound equivalent to that of 
Theorem 11.3 by using Theorem 11.2 with the polynomial potential tuned to p = 2 Ind. 

The exponential potential has the additional limitation of using only positive weights. 
As explained in Grove, Littlestone, and Schuurmans [133], this limitation can be actually 
overcome by feeding to the forecaster modified side-information vectors x, € R”, where 


/ 
X, => (—X145 Xit> —X2,t5 X25 sey —Xad,t> Xd.) 
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and x; = (X17, .--,Xa,r) is the original unmodified vector. Note that running the gradient- 
based linear forecaster with exponential potential on these modified vectors amounts to 
running the same forecaster with the hyperbolic cosine potential (see Example 11.5) on the 
original sequence of side information vectors. 

We close this section by comparing the results obtained so far for the gradient-based 
forecaster. Note that the bounds of Theorem 11.2 (for the polynomial potential) and The- 
orem 11.3 (for the exponential potential) are different just because they use different pairs 
of dual norms. To allow a fair comparison between these bounds, fix some linear expert u 
and consider a slight extension of Theorem 11.3 in which we scale the simplex such that 
each w; is projected enough to contain the chosen u. Then the bound for the exponential 
potential takes the form 


Liu) 
l—e 2e(1 —-e) 


x (Ind) llull? XZ, 


where X œ = max; ||X;||.o. The bound of Theorem 11.2 for the polynomial potential is very 
similar: 
L? (u) a p— 


1 2 y2 
x lul? x2, 
l—e 2e(l-—e) 2 


where X , = max, ||x;||,, and (p, q) are conjugate exponents. Note that the two bounds differ 
only because the sizes of u and x; are measured using different pairs of dual norms (1 is the 
conjugate exponent of oo). For p ~ 21nd the bound for the polynomial potential becomes 
essentially equivalent to the one for the exponential potential. To analyze the other extreme, 
p = 2 (the spherical potential), note that, using ||v||,. < Ilvll2 < llvil, for all v € Rf, it 
is easy to construct sequences such that one of the two potentials gives a regret bound 
substantially smaller than the other. For instance, consider a sequence (x1, y1), (X2, y2)... 
where, for all t, x, € {—1, 1}4 and y= u! -x for u = (1,0,...,0). Then lulż X2 =d 
and lull? X a = |. Hence, the exponential potential has a considerable advantage when a 
sequence of “dense” side information x, can be well predicted by some “sparse” expert u. In 
the symmetric situation (sparse side information and dense experts) the spherical potential 
is better. As shown by the arguments of Kivinen and Warmuth [181], these discrepancies 
turn out to be real properties of the algorithms, and not mere artifacts of the proofs. 


11.5 Forecasters Using Bregman Projections 


In this section we introduce a modified gradient-based linear forecaster, which always 
chooses its weights from a given convex set. This is done by following each gradient-based 
update by a projection onto the convex set, where the projection is based on the Bregman 
divergence defined by the forecaster’s potential (Theorem 11.3 is a first example of this 
technique). Using this simple trick, we are able to extend some of the results proven in the 
previous sections. 

In essence, projection is used to guarantee that the forecaster’s weights are kept in a 
region where they enjoy certain useful properties. For example, in one of the applications 
of projected forecasters shown later, confining the weights in a convex region S of small 
diameter allows to effectively “track” the best linear forecaster as it moves around in the 
same region S. This is reminiscent of the “weight sharing” technique of Section 5.2, where, 
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in order to track the best expert, the forecaster’s weights were kept close to the uniform 
distribution. 
Let S C R? be a convex and closed set and define the forecaster based on the update 


w, = Po (w, S) (projected gradient-based linear update), 


where w, = V® (V ®*(w,_1) —AVE,(W;_1 )) is the standard dual gradient update, used here 
as an intermediate step, and P».(v, S) denotes the Bregman projection argmin,.; Do«(Uu, v) 
of v onto S (if v € S, then we set P»:(v, S) = v). We now show two interesting applications 
of the projected forecaster. 


Tracking Linear Experts 
All results shown so far in this chapter have a common feature: even though the bounds 
hold for wide classes of data sequences, the forecaster’s regret is always measured against 
the best fixed linear expert. In this section we show that the projected gradient-based linear 
forecaster has a good regret against any linear expert that is allowed to change its weight at 
each time step in a controlled fashion. More precisely, the regret of the projected forecaster 
is shown to scale with a measure of the overall amount of changes the expert undergoes. 
Given a transfer function o and an arbitrary sequence of experts (u,;) = Ug, U1, ... € RI, 
define the tracking regret by 


Re ((u,)) = Ez — LD = >> g w) Y E i). 
t=1 t=1 


Theorem 11.4. Fix any a-subquadratic nice pair (ø, £). Let 1/p + 1/4 = 1, U; > 0, and 
£ € (0, 1) be parameters. If the projected gradient-based forecaster based on the Legendre 
polynomial potential ®, is run with S = {w eR’: ®,(w) < U,} and learning rate à = 
2e/((p — la X?) on a sequence (X1, y1), (X2, y2)... in R? x R, then for all sequences 


(u) = uo, U; ... € S and for all n > 1 such that maX,=1,...n |X ll, < Xp, 


sais 


=, . L2((u))  (p— Da X? Ll 
ee eee 2 (yaY Dla — wll; + 


Observe that the smaller the parameter U4, the smaller the second term becomes in the 
upper bound. On the other hand, with a small value of Uz, the set S shrinks and the loss of 
the forecaster is compared to sequences (u;) taking values in a reduced set. 

Note that when up = u; = --- = uy, the regret bound reduces to the bound proven in 
Theorem 11.2 for the regret against a fixed linear expert. The term 


n 
Se lui = wl 
t=1 


can thus be viewed as a measure of “complexity” for the sequence (u,). 
Before proving Theorem 11.4 we need a technical lemma stating that the polynomial 
potential of a vector is invariant with respect to the invertible mapping V ®,. 
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Lemma 11.6. Let ®, be the Legendre polynomial potential. For all 0 € R’, ®,(0) = 
d, (Vo,()). 


Proof. Letw = V®,(@). By Lemma 11.4, &,(0) + (w) = 0 - w. By Hélder’s inequal- 
ity, 0- w < 2,/®,(0)@,(w). Hence, 


(Va - JE) <0, 


which implies ®,(0) = ®,(w). E 


We are now ready to prove the theorem bounding the tracking regret of the projected 
gradient-based forecaster. 


Proof of Theorem 11.4. From the proof of Theorem 11.1 we get 
7 (wi-1) — £7 (Uy-1) 


1 
< ~ (Do, 0m, wi-1) — Do, (u-1, W,) + Do, (Wr-1, w)) 


1 
< al ©, (Wr-1; W1) — Do, (1, Wi) + Do, (Wr-1, w,)), 


where in the second step we used the fact that, since u,_; E€ S, by the generalized 


pythagorean inequality Lemma 11.3, 


Do, (u-1, w;) = De, (uy-1, Wr) + Do, (W,, wr) 
> Do, (W-1, Wr). 


With the purpose of obtaining a telescoping sum, we add and subtract Do, (U;—1, Wz) — 
Do,(Ur, W,) in the last formula, obtaining 


€7 (Wr-1) — & (Wy-1) 
< 1 (De, 1, w1) — Do, (U, Wr) 
— Do, (w1, Wr) + Do, (W, Wi) + Do, (Wri, w)); 
We analyze the five terms on the right-hand side as we sum for t = 1, ...,n. The first two 


terms telescope; hence — as in the proof of Theorem 11.2 — we get 


n 


Y (Do, (1, W1) — Do, (Ur, Wr) < Pq (Uo). 


t=1 


Using the definition of Bregman divergence, the third and fourth terms can be rewritten as 


—De,-1, w) + Do, u, w) 
= O,(u,) — ®g(u,_1) + -1 — u) - VO (w). 
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Note that the sum of @,(u,) — ®,(u;_1) telescopes. Moreover, letting 0; = V ®,(w,), 


(u;-1 — uw): Vw) = (w1 — w): 8; 


< 2,/®, (u,_1 — u,)®,(0;) (by Hölder’s inequality) 


= 2,/®,(u,_; — u,)®,(w,) (by Lemma 11.6) 
< |u: — ull, J2U, (since w, € S). 
Hence, 


n 


-$ (De, 1, W:) + Do, (Wr, W,)) 


t=1 


< (uy) — Dq (U0) + 20, X lu- — wll, - 
t=1 


Finally, we bound the fifth term using the techniques described in the proof of Theorem 11.2: 


Á AA aa 
D2 De, (w1 W) < P- DŽ X Eg. 
t=1 
Hence, piecing everything together, 


n 


DOE wa) = £? (u,-1)) 


t=1 


< ®, (uo) + p, (un) = ®, (Uo) 
mi À 


JZU AÀ 2c 
+ Er lui = url +P- D> Xp En 


®,(u,) i V2 


4 5 I I ( 1) a z 
U;_1 —U + p Kab 
À À = ae ang 2 Pat 


Rearranging, substituting our choice of A, and using the equality ®,(u,,) = 5 u, I? yields 
the desired result. W 


Self-Confident Linear Forecasters 

The results of Section 11.4 leave open the problem of finding a forecaster with a regret 
growing sublinearly in L4 (u). If one limited the possible values of u to a bounded subset, 
then by choosing the learning rate A as a function of n (assuming that the total number of 
rounds n is known in advance), one could optimize the bound for the maximal regret and 
obtain a bound that grows at a rate of ./n. However, ideally, the regret bound should scale 
as JL? (u) for each u. Next, we show that, using a time-varying learning rate, the regret of 
the projected forecaster is bounded by a quantity of the order of VLS) uniformly over 
time and for all u in a region of bounded potential. 
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SELF-CONFIDENT GRADIENT-BASED FORECASTER 


Parameters: o-subquadratic nice pair (o,£), reals p > 2, U, > 0, convex set 
S= {ue R! : &,(u) < Uz}, where 1/p + 1/q = 1. 

Initialization: wo = V ,(0). 

For each round t = 1,2,... 


(1) observe x, and predict P, = o (W;—1 - X+); 
(2) get y, € R and incur loss £f (w;-1) = (Pr, yz); 
(3) let w, = V®, (VE, (w1) —A,Ve? (w;-1)); where 


bi kı 
Se Xp = max Xs lp» = 
(p — Da Xp s=1,..,t ky + L° 


Àr 


t 
ky = (p — DaX? Ug, E? = >> E(w); 
s=1 


(4) let w, = Po, (w;, S). 


A similar result is achieved in Section 2.3 of Chapter 2 by tuning the exponentially weighted 
average forecaster at time ¢ with the loss of the best expert up to time f. Here, however, 
there are infinitely many experts, and the problem of tracking the cumulative loss of the best 
expert could easily become impractical. Hence, we use the trick of tuning the forecaster at 
time ¢ + 1 using his own loss Le. Since replacing the best expert’s loss with the forecaster’s 
loss is justified only if the forecaster is doing almost as well as the currently best expert 
in predicting the sequence, we call this the “self-confident” gradient-based forecaster. The 
next result shows that the forecaster’s self-confidence is indeed justified. 


Theorem 11.5. Fix any a-subquadratic nice pair (o, £). If the self-confident gradient-based 
forecaster is run on a sequence (Xj, y1), (X2, y2)... € R’ x R, then for allue R? such 
that ®,(u) < U; and for all n > 1, 


RZ < 5\/(p = Da X2Uy Leu) + 30(p — 1a X2Us, 


where Xp = MaX;=1,...n ||X; Il p- 


Note that the self-confident forecaster assumes a bound 2U, on the norm ||u||, of the 
linear experts u against which the regret is measured. The gradient-based forecaster of Sec- 
tion 11.4, instead, assumes a bound X, on the largest norm ||x;||,, of the side-information 
sequence X),..., Xn. Since u- x; < ||ull, Ilx; |l, the two assumptions impose similar con- 
straints on the experts. 

Before proceeding to the proof of Theorem 11.5, we give two lemmas. The first states that 
the divergence of vectors with bounded polynomial potential is bounded. This simple fact is 
crucial in the proof of the theorem. Note that the same lemma is not true for the exponential 
potential, and this is the main reason why we analyze the self-confident predictor for the 
polynomial potential only. 
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Lemma 11.7. For allu, v € Rf such that ®,(u) < U; and ©,(v) < Ug, Do, U, V) < 4 U4. 
Proof. First of all, note that the Bregman divergence based on the polynomial potential 
can be rewritten as 

Do, (u, v) = &,(u) + Pav) —u- Vw). 
This implies the following: 


Do, (U, v) 
< @,(u) + $, (v) + lu - V, (w)| 


< @,(u) + @,(v) + 2/0, (u)®, (v p, (w)) (by Hölder’s inequality) 


= ®,(u) + Pv) + 2y 8U) ®, (wW) (by Lemma 11.6) 
<4U, 


and the proof is concluded. E 
The proof of the next lemma is left as exercise. 


Lemma 11.8. Let a, £, ..., £l, be nonnegative real numbers. Then 


2 <2 a+ 4-va 


t=1 Ja+ ib = t=1 


Proof of Theorem 11.5. Proceeding as in the proof of Theorem 11.4, we get 


£; (w1) — & (u) 
1 J 
=e — (Do, (u, W-1) — Do, (u, w,) + Do, (wi-1.w,)). 


Now, by our choice of ,;, 


1 
— (De, u, w,-1) — Do, (U, Wr) + Do, (Wi-1, w)) 
t 


x2 Xx? 
en 1 Pip Pip 1 D / 
=(p — la Fa ©, U, W1) — 3. ©, (U, W,) E ©, (W1, W,) 


X? X? 
=(p- la (= Do, U, Wi-1) — a De, (u, w) 
f t+1 


X? X? 1 
+ De, (u, w;) (= H) +y De, (w1, W,) 
t+1 t 


B; i Bit 


x? xX? 1 
moa eaa pi Tr Da, (w wi), 
Bii B; 


xX? xX? 
<(p- Da (an u, w1) — = De, (u, w) 
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where we used Lemma 11.7 in the last step. To bound the last term, we again apply 
techniques from the proof of Theorem 11.2: 


1 ar B 
Pe (w-1: W) < (P X? (wa) < = e? (w1). 
We now sum over ¢ = 1,..., n. Note that the quantity X ae 41 /Bn+1 iS a free parameter 


here, and we conveniently set B,+1 = n and X p.n41 = X p,n. This yields 


n 


XO (2 w) — €7 (uw) 


t=1 


Xx? X2? X2 1 n 
2 -1 p.l p,n+1 pl ee x 
Si k e a i Bn41 By +5 Lb e (Wr-1) 


n 


<4(p = aU, 2 ee 2 be (w1), 


where we again applied Lemma 11.7 to the term Do, (u, Wo). Recalling that k, = (p — 
1D X = Uq, we can rewrite the above as 


n 


Fe 1 n 
DAG Or) — €7(u)) < 4y kn(kn + £2) + . DBE) 
4? (w1) 
< Ay kn (kn + L2) + 
vee Vi ia 


where we used 


Applying Lemma 11.8, we then immediately get 


Ê? — LIC) < 4y/ kulka + LI) + y kalka + £9) = Syl ky (ky +22). 


Solving for Le and overapproximating gives the desired result. W 


Projected gradient-based forecasters can be used with the exponential potential. How- 
ever, the convex set S onto which weights are projected should not be taken as the probability 
simplex in R¢, which is the most natural choice for this potential (see, e.g., Theorem 11.3). 
The region of the simplex where one or more of the weight components are close to 0 would 
blow up either the dual of the exponential potential (preventing the tracking analysis of 
Theorem 11.4) or the Bregman divergence (preventing the self-confident analysis of Theo- 
rem 11.5). A simple trick to fix this problem is to intersect the simplex with the hypercube 
[6/d, 1]“, where 0 < 6 < 1 is a free parameter. This amounts to imposing a lower bound 
on each weight component (see Exercise 11.10). 
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11.6 Time-Varying Potentials 


Consider again the gradient-based forecasters introduced in Section 11.3 (without the use 
of any transfer function). To motivate such forecasters, we observed that the dual weight 
w, is a solution to the convex minimization problem 


min] De-a, w1) +A( G (w1) + 0 = w-wa) |, 
ue 

where the term £;(w;—1) + (u — w;_1)V£;(W;_1) is the first-order Taylor approximation 
of £, (u) around w;_;. As mentioned in Exercise 11.3, an alternative (and more natural) 
definition of w, would look like 


Ww, = argmin| Do-(u, wri) + 2 6,(u)]. 


ucR? 


An interesting closed-form solution for w, in this expression is obtained for a potential 

that evolves with time, where the evolution depends on the loss. In particular, let ® be an 
arbitrary Legendre potential and ®* its dual potential. Define the recurrence 

5 = &* 

time-varying potential). 

CeT ( ying p ) 


Here, as usual, £;(w) = £(w - X;, y+) is the convex function induced by the loss at time ¢ and, 
conventionally, we let £o be the zero function. If the potentials in the sequence 5, Pj, ... 
are all Legendre, then for all t > 1, the associated forecaster is defined by 


W = argmin| Do; ,(u, w,—1) + aw], b= S 


ucR?¢ 


where Wo = 0. (Note that, for simplicity, here we take A = 1 because the learning parameter 
is not exploited below.) This can also be rewritten as 


W = argmin| Do; ,(u, w1) + ®* (u) — jw]. 


ucR’ 


By setting to O the gradient of the expression in brackets, one finds that the solu- 
tion w, is defined by V®*(w,) = V®*_,(w;_1). Solving for w, one then gets w, = 
VO, (Vb*_,(w,-1)). Note that, due to VO*(w,) = V*_,(w,_1), the above solution can 
also be written as w, = V®,(00), where 09 = V®*(0) is a base primal weight. This shows 
that, in contrast to the fixed potential case, no explicit update is carried out, and the poten- 
tial evolution is entirely responsible for the weight dynamics. The following two diagrams 
below illustrate the evolution of primal and dual weights in the case of fixed (left-hand side) 
and time-varying (right-hand side) potential. 


00 0; vee 0, bo 
v) vl | vol OT 


Wo Wi dki W; Wo Wi sea W; 
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Remark 11.1. If V&*(0) = 0, then V&*(w,) = 0 for all ¢ > 0 and one can equivalently 
define w, by 


w; = argmin ®*(u), 
ucR? 


where, we recall, ®¥ is convex for all ż (see also Exercise 11.13 for additional remarks on 
this alternative formulation). 


Remark 11.2. Note that expanding the Bregman divergence term in the above definition of 
gradient-based linear forecaster, and performing an obvious simplification, yields 


wW = argmin| F(a) — OF (wi) +(u— w)VOF(w-1)]. 


ucR? 


The term in brackets is the difference between ®¥(u) and the linear approximation of ®*_, 
around w;—1, which again looks like a divergence. 


The gradient-based forecaster using the time-varying potential defined above is sketched 
in the following. 


GRADIENT-BASED FORECASTER WITH TIME-VARYING POTENTIAL 
Initialization: wo = 0 and j = &*. 
For each round t = 1, 2,... 


(1) observe x, and predict P, = wy—1 - Xr; 
(2) get y, € R and incur loss €,(w,;_1) = LD, yz); 
(3) let w, = V®,(V*_,(w,_1)). 


Theorem 11.6. Fix a regular loss function £. If the gradient-based linear forecaster is run 
with a time-varying Legendre potential ® such that V&*(0) = 0, then, for allu € RË, 


n 
R,(u) = Dos (a, Wo) — Doz (a, Wn) + D> Dos (Wii, Wi). 


t=1 


Proof. Choose any u € R?. Using V®*(w,) = 0 for all £ > 0 (see Remark 11.1), one 
immediately gets Do:(u, w,) = PF (u) — OF (w,) for all u € R’. Since ®*(u) = * u) + 
£,(u), we get £U) = Da (u, w,) + ®7(w,) — P% u) and £, (w:-1) = Doz (Wi-1, Wr) + 
D*(w,) — ®¥_ (w1). This yields 
lL (Ww-1) — b- 0) 
= Do:(W;-1, Wr) — $% (w1) — Da: (u, w,) + $70) 


= Do; (W:-1, Wr) — Do; (U, w:) + Do: (U, w1). 


Summing over f gives the desired result. W 
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Note that Theorem 11.6 provides an exact characterization (there are no inequalities!) of 
the regret in terms of Bregman divergences. In Section 11.7 we show the elliptic potential, 
a concrete example of time-varying potential. Using Theorem 11.6 we derive a bound on 
the cumulative regret of the gradient-based forecaster using the elliptic potential. 


11.7 The Elliptic Potential 


We now show an application of the time-varying potential based on the polynomial potential 
(with p = 2) and the square loss. To this end, we introduce some additional notation. 

A vector u is always understood as a column vector. Let (-)' be the transpose operator. 
Then u’ is a row vector, A' is the transpose of matrix A, and u'v is the inner product 
between vectors u and v (which we also denote with u - v). Likewise, we define the outer 
product uv' yielding the square matrix whose element in row i and column j is u;v j. 

Let the d x d matrix M be symmetric, positive definite and of rank d. Let v € R? and 
c € R be arbitrary. The triple (M, v, c) defines the potential 


1 
(u) = zu Mu+ulv+c (the elliptic potential). 


1/2 exists, and therefore we can also write 


Note that because M is positive definite, M 
(u) = 5 || 17 2uj? +u' v+ c. The elliptic potential is easily seen to be Legendre with 
V®(u) = M u + v. Since M is full rank, M~! exists and V®*(u) = M~!(u — v). Hence, 
1 1 1 
Pw = 5 |M u-v]? = 5 fetal? -TMy + 5 Moy)”, 
which shows that the dual potential of an elliptic potential is also elliptic. 
Elliptic potentials enjoy the following property. 


Lemma 11.9. Let ® be an elliptic potential defined by the triple (M, v, c). Then, for all 
u,we R%, 


Do(u, w) = |M"? (u — w|”. 


1 
A 
The proof is left as an exercise. 

We now show that the time-varying potential obtained from the polynomial potential 
(u) = 5 lulj? and the square loss £(@, y) = i(p — y) is an elliptic potential. First note 
that P = &*, due to the self-duality of the 2-norm. Rewrite ®*(u) as sul] u, where / is 
the d x d identity matrix, and let P = &*. Then, following the definition of time-varying 
potential, 


7 (u) = ğu) + £u) 


= <u'Ju+ (u'x; — yi) 


ile 1 a 
= -u'/u+ 3" x xj u— y;uū' x; + F 


l r T T y? 
= -u (I +xı xj )u— yu Nig: 
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Iterating this argument, we obtain the time-varying elliptic potential 
1 t t 1 t 
l È l TY 2 
Pu = zu (1+ Xxx)u-u 2% +5 DI 


We now take a look at the update rule w, = V®, (Vor 1(Wr-1)) when ©, is the time- 
varying elliptic potential. Introduce 


t 
a= (14 3x21] for allt = 0, 1,2,... 
atl 


Using the definition of ®*, we can easily compute the gradient 


t 
V@r(u) = A,u— > Ys Xp. 


Before proceeding with the argument, we need to check that ®¥ is Legendre. Since ®F 
is defined on Rf, and V ®* computed above is continuous, we only need to verify that &* 
is strictly convex. To see this, note that A; is the Hessian matrix of ®?. Furthermore, A; is 
positive definite for all t = 0, 1, ... because 


d 
vA, v = |ivl|? + ev >0 
s=l1 
for any v € Rf, and v! A, v = Oif and only if v = 0. This implies that @* is strictly convex. 
Since ®* is Legendre, Lemma 11.5 applies, and we can invert V ®* to obtain 


t 
Vö, (u) = A; ! ( +o ys x) . 
s=l 


This last equation immediately yields a closed-form expression for w,: 
t 
WwW; = V®,(0) = A Xo Ys Xs. 


In this form, w, can be recognized as the solution of 


argmin E lull? + se Xs — Ys) | = argmin ®*(u), 
ucR?¢ ueR¢@ 
which defines the well-known ridge regression estimator of Hoerl and Kennard [162]. As 
noted in Remark 11.1, of Section 11.6, this alternative nonrecursive definition of w, is 
possible under any time-varying potential. However, notwithstanding this equivalence, it 
is the recursive definition of w, that makes it easy to prove regret bounds as witnessed by 
Theorem 11.6. 

After this short digression, we now return to the analysis of the regret. Applying the 
above formulas to w, = V®, (v *_,(w-1)), and using a little algebra, we obtain w, = 
A;! (A;-1 Wr-1 + yr Xr). We now state (proof left as exercise) a more explicit form for 
the update rule. In the rest of this chapter, we abbreviate the name of the gradient-based 
forecaster using the time-varying elliptic potential with the more concise “ridge regression 
forecaster.” 
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The next lemma, used in the analysis of ridge regression, shows that the weight update 
rule of this forecaster can be written as an instance of the Widrow-Hoff rule (mentioned in 
Section 11.4) using a real matrix (instead of a scalar) as learning rate. 


Lemma 11.10. The weights generated by the ridge regression forecaster satisfy, for each 
t>1, 


-1 (qT 
Wi = Wii — A7 (WLX — yr) X- 
Before proving the main result of this section, we need a further technical lemma. 


Lemma 11.11. Let B be an arbitrary n x n full-rank matrix, let x an arbitrary vector, and 
let A = B + xx” . Then 


det(B 
x A x=1-— A ) 
det(A) 
Proof. If x = (0, ..., 0), then the theorem holds trivially. Otherwise, we write 


B =A-xx' =A(I — Axx’). 


Hence, computing the determinant of the leftmost and rigthmost matrices, 


det(B) = det(A) det (J — A7'xx'). 
The right-hand side of this equation can be transformed as follows: 


det(A) det (7 — A7'xx') 
= det(A) det (A'/”) det (I — A~'xx") det (A~"””) 
= det(A) det (7 — AT'?xx"A7"”) . 
Hence, we are left to show that det (J — A~!/?xx'A7'/?) = 1 — x! A™' x. Letting z= 
A7'/?x, this can be rewritten as det(J — zz') = 1 —z"z. It is easy to see that z is an 
eigenvector of / —zz' with eigenvalue 4; = 1 — z'z. Moreover, the remaining d — 1 


eigenvectors uz, . . . , Ug of J — zz' form an orthogonal basis of the subspace of R? orthog- 
onal to z, and the corresponding eigenvalues Az... , Aq are all equal to 1. Hence, 


d 
det] —zz') = Į [>~ =1-z'z, 


jel 


which concludes the proof. W 


We are now ready to prove a bound on the regret for the square loss of the ridge regression 
forecaster. 


Theorem 11.7. If the ridge regression forecaster is run on a sequence (X,, y1), (X2, y2)... € 
R? x R, then, for all u € R7 and for all n > 1, the regret R,(u) defined in terms of the 
square loss satisfies 


d 
1 
Rw = 5 lull? + (> In (1 + ao) max €:(W:-1), 


e T E 


where àı, ... , àq are the eigenvalues of the matrix xı X} + +- +X, X, . 
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Proof. Note that V5(0) = V5 \|0||? = 0. Hence, Theorem 11.6 can be applied. Using 
the nonnegativity of Bregman divergences, we get 


n 
R,(u) < Da; (u, Wo) + È Do; (Wii, Wr), 
t=1 


where Dox(U, Wo) = 5 lul]? because wo = 0. 
Using Lemmas 11.9 and 11.10, we can write 


1 = 
Do(Wr-1, w,) = zW- —W,) AW- — Wr) 


1 
= 5 W- Wi) (w 1X) — yr) X 


1 2 7, 
z (w iX; — yr) x, A, x, 


= £,(w;-1) x; Ay!x,. 


Applying Lemma 11.11, we get 


n 


a ae __ det(A;_1) 
2% A, X= (1 IAN. ) 


t=1 


det(A 
< yie (because 1 — x < — lnx for all x > 0) 
oy delAm) 
det(A,) 
det(Ag) 


d 
= Doin +A), 
i=1 


where the last equality holds because det(Ay) = det(/) = 1 and because det(A,,) = (1 + 
A) X +++ X (1 + àa), where 4;,..., Aq are the eigenvalues of the d x d matrix A, — I = 
xı X] +---+x,x,. Hence, 


n d 
> ew) x! Ap x, < (> In(1 + ao) max L(Wi) 


t=1 j=) 0 7 


this concludes the proof. W 


Theorem 11.7 is somewhat disappointing because we do not know how to control the 
term max, €,(w;_,). If this term were bounded by a constant (which is certainly the case if the 
pairs (x,;, y,) come from a bounded subset of R? x R), then the regret would be bounded by 
O (Inn), an exponential improvement over the forecasters using fixed potentials! To see this, 
choose X such that ||x;|| < X forallt = 1,...,. A basic algebraic fact states that A, — I 
has thesame nonzero eigenvalues as the matrix G with entries G;,; = x; x; (G is called the 
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nX?. The quantity (1 + 1) x... x (1 + àq), under the constraint 4; ++- + Àa < nX?, 
is maximized when A; = n X?/d for each i. This gives 


Gram matrix of the points X1, .. . , Xn). Therefore A; +--+ + Ag = X] Xi +++: +x! x, < 


d 


nX? 
XO na +4;) <d In(1+ ae 


i=1 


11.8 A Nonlinear Forecaster 


In this section we show a variant of the ridge regression forecaster achieving a logarithmic 
regret bound with an improved leading constant. In Section 11.9 we show that this constant 
is optimal. 

We start from the nonrecursive definition, given in Section 11.7, of the gradient-based 
forecaster using the time-varying elliptic potential (i.e., the ridge regression forecaster) 


1 2,1 — T 2 
-1 = argmin | = =) s — ys)? | = argmin &*_,(u). 
W1 sani | 5 llull + 5 (U Xs — ys) argmin ®*_,(u) 


s=l ucR¢ 


We now introduce the Vovk—Azoury—Warmuth forecaster, introduced by Vovk [300], as an 
extension to linear experts of his aggregating forecaster (see Sections 3.5 and 11.10), and 
also studied by Azoury and Warmuth [20] as a special case of a different algorithm. The 
Vovk—Azoury—Warmuth forecaster predicts at time tf with W, X, where 


= : 1 2 cc T 2, l T 2 
Wen ED +5 Le Xs — Ys) +50 x) |. 


Note that now the weight W, used to predict x; has index ¢ rather than ¢ — 1. We use this, 
along with the “hat” notation, to stress the fact that now W, does depend on x,. Note that 
this makes the Vovk—Azoury—Warmuth forecaster nonlinear. 

Comparing this choice of W, with the corresponding choice of w,_; for the gradient- 
based forecaster, one can note that we simply added the term $(u'x,)*. This additional 
term can be viewed as the loss 5(u'x, — yo, where the outcome y,, unavailable when W, 
is computed, has been “estimated” by 0. 

As we did in Section 11.7, it is easy to derive an explicit form for this new W, also: 


t-1 

a -1 

wW = A, > YsXs. 
s=1 


We now prove a logarithmic bound on the regret for the square loss of the Vovk—Azoury— 
Warmuth forecaster. Though its proof heavily relies on the techniques developed for proving 
Theorems 11.6 and 11.7, it is not clear how to derive the bound directly as a corollary of 
those results. On the other hand, the same regret bound can be obtained (see the proof 
in Azoury and Warmuth [20]) using the gradient-based linear forecaster with a modified 
time-varying potential, and then adapting the proof of Theorem 11.6 in Section 11.6. We 
do not follow that route in order to keep the proof as simple as possible. 
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Theorem 11.8. If the Vovk—-Azoury—Warmuth forecaster is run on a sequence 
(X1, y1), (Xo, y2),--. € R? xR, then, for all u € R7 and foralln > 1, 


sosa Ee 


' a i 14k 
u — ln y 
2 d 


n |y:l, and à1, . . . , àq are the eigenvalues of the 


where X = max;—| 


a T T 
matrix Xı X; +++++XnX,. 


se WAT TEs E m MHAE 1,..., 


Proof. For convenience, we introduce the shorthand 


t 
a, = 5 YtXı. 
s=1 


In what follows, we use W, = Alanı to denote the weight, at time t, of the Vovk- 
Azoury—Warmuth forecaster, and w;_; = Atl ar to denote the weight, at time t, of the 
ridge regression forecaster. A key step in the proof is the observation that, for all u € R¢, 


1 1 
L,(u) > inf ®*(v) — = lul? = *(w,) — = lul? , 
veR? 2 2: 


where we recall that L, (u) = €;(u) +--- + £, (u). Hence, the loss of an arbitrary linear 
expert is simply bounded by the potential of the weight of the ridge regression forecaster. 
This can be exploited as follows: 


R,(u) =, — L,(u) 


a~ 


1 2 @* 
S La + 5 Mull’ — ©, (Wn) 


= X (L) + ®t (w1) — ®F(w,)) 


= Do (L@) — &(w:-1)) + D> Do; (Wi, Wi), 
t=1 t=1 


where in the last step we used the equality 
Li (Wi) = Do; (Wii, Wi) + OF (w,) — DW) 


established at the beginning of the proof of Theorem 11.6. Note that we upper bounded the 
regret R,,(u) with the difference 


Ehe) uw) 


t=1 


between the Vovk—Azoury—Warmuth forecaster and the ridge regression forecaster plus a 
sum of divergence terms. 
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Now, the identities A, — A;_) = X; X 
A>! = A7 'x, x] A7!, together imply that 


T -1 e= al Tae —1 
t? Ay ET A; = A,X X, A; and Arı F 


AH —A, Es A, x, x, A; I> (x! ALX) A,X; x) Ap. 
Using this and the definition of the time-varying elliptic potential (see Section 11.7), 


1 Ñ 
PIW) = Sw) AW — w/a + 5 ye, 


s=l 
we prove that 
£,(W,) — L (w1) + Do: (Wi-1, Wr) 
2 
easy 


2 
y = 1 - = 
= aX A x, = 3 (x; Az')x;) (wx)? < z * A, 'X,, 


where the term dropped in the last step is negative (recall that A; is positive definite implying 
that A7! also is positive definite). Using Lemma 11.11 then gives the desired result. I 


As a final remark, note that, using the Sherman—Morrison formula, 


(4721x) Ana 
1 +x Ax 


Ap =A 


The d x d matrix A}, where A; = A1 + X; x; can be computed from A in time 
©(d?). This is much better than @(d?) required by a direct inversion of A;. Hence, both the 
ridge regression update and the Vovk—Azoury—Warmuth update can be computed in time 
©(d?). In contrast, the time needed to compute the forecasters based on fixed potentials is 
only ©(d). 
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We now prove that the Vovk—Azoury—Warmuth forecaster is optimal in the sense that its 
leading constant cannot be decreased. 


Theorem 11.9. Let £ be the square loss &(p, y) = (p — y)*. For alld > 1, forall Y > 0, 
for all e > 0, and for any forecaster, there exists a sequence (X1, y1), (X2, Y2),... € R4 x 
[-Y, Y], with ||x;|| = 1 fort = 1,2,..., such that 

L, = infuer (Law + llul?) y? 


lim inf d 
RO inn 2d —s)> 


Proof. We only prove the theorem in the case d = 1; the easy generalization to an arbitrary 
d > Lis left as exercise. Without loss of generality, set Y = 1/2 (the bound can be rescaled 
to any range). Let x, = 1 for t = 1, 2,..., so that all losses are of the form iw — yy. 
Since this value does not change if the same constant is added to w and y, without loss 
of generality we may assume that y; € {0, 1} for all £ (instead of y, € {— 1/2, 1/2}, as 
suggested by the assumption Y = 1/2). 
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Let L(F, y”) be the cumulative square loss of forecaster F on the sequence 


(l, yi),.--, (1, yn) and let L(u, y”) be the cumulative square loss of expert u on the same 
sequence. Then 


inf max (LE, y") = inf (L(u, y") + u’)) 
F y u 


> inf E[ LCF. Yi, -+ Yn) — inf (L, Yi, -> Yn) +4°)], 


where the expectation is taken with respect to a probability distribution on {0, 1}” defined 
as follows: first, Z € [0, 1] is drawn from the Beta distribution with parameters (a, a), 
where a > 1 is specified later. Then each Y, is drawn independently from a Bernoulli 
distribution of parameter Z. It is easy to show that the forecaster F achieving the infimum 
of E L(F, Y1, ..., Yn) predicts at time ¢ + 1 by minimizing the expected loss on the next 
outcome Y;,; conditioned on the realizations of the previous outcomes Y4, ...,Y,. The 
prediction P;+ı minimizing the expected square loss on Y;., is simply the expected value 
of Y;4; conditioned on the number of times S, = Yı +---+ Y, the outcome | showed up 
in the past. Using simple properties of the Beta distribution (see Section A.1.9), we find 
that the conditional density fz(p | S; = k) of Z, given the event S, = k, equals 


pea _ pre 
Biatk,a+t—k)’ 


fz(p |S; = k) = 


where B(x, y) = P(x) FP (y)/ TP (x + y) is the Beta function. Therefore, 


1 
TAE Ma 1S =k = f Yeu |Z = pl fe(p |S = Bap 
0 


a f p™ta <3) pre 
0 B(a+k,a+t-— k) 
k+a 
t +2a` 


Let E, be the conditional expectation E[- | Z = p]. Then, 


ip [Pri — Y] = Ep [Pr — pY] +E, [Y — p] 


S 2 
= (B -?) ]rea-» 


S +a pt+a A ptt+a $ 
=E, = = 1— 
E eta) EP t + 2a EE) 


(since ES, = pt) 
Sı — pt 2ap —a 
t+ 2a t+ 2a 


_ tpd— p) 2ap — a 
~ (t +2a} t+ 2a 


“p 


2 
) +p- p) 


2 
) + pd — p). 
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Hence, recalling that a > 1, the expected cumulative loss of the optimal forecaster F is 


computed as 


n—1 


rai = 1 7} : 
LF Yin ¥o) = EIZA- 2 ae 
pl [eaz -aF 1 
2 a (t + 2a) 
1 n-1 
E 2 [Z(G — Z)] 
bd aza- f d 
=g 1 (t + 2a)? 
a 1 nl dt 
ay (a3 s ru) 
y 2c ; 
5 3 [Z(1 — Z)]. 


We now lower bound the three terms on the right-hand side. The integral in the first term 
equals 
n—-1+2a 
1+ 2a 


2a(n — 2) z 
G42an—1424) 
where, here and in what follows, ©(1) is understood for a constant and n — oo. As the 
entire second term is @(1), using E[Z (1 — Z)] = a/(4a + 2), we get 

a n—1+2a an 
4Qa+t) 142a | 4Qa4+1) 


Now we compute the cumulative loss of the best expert. Recalling that S, = Y; +---+Yp, 
we have 


n—1+2a 
1+2a 


n 


+ 0(1), 


SL(F, Yi, ..., Y) = 


+ @(1). 


lp [im - r| 


t=1 


n 2 n 2 1 
<1 eD ` y ~ o2 
= pY; = > p ( ‘| Ts Lp Sas 


t=1 


Hence, recalling that the variance of any random variable X is E X? — (Œ X)’, and that the 


variance of the Binomial random variable S$, is np(1 — p), we get 
i E Su = r| 
t=1 


2 1 
= np — (pa — p) + (např) + —(np(. — p) + (np)’) 


n 


=np(l — p) — p(l — p). 
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Integrating over p yields 


|. i = 2] _ an 
far 13 |= 22 qp ~ Pil - P). 


Therefore, as 0 < u < 1, 


[LCF Ya,- -+ Yn) — inf (LG, Yas - -s Yn) +07) | 


a n—1+2a P an an 
= n 
4(2a + 1) 1 + 2a 4(2a+1) 42a+l1) 


1 a ay eee aL Ye 
= n $, 
2441/8” 142a 


We conclude the proof by observing that the factor multiplying ln n can be made arbitrarily 
close to 1 by choosing a large enough. W 


+ 0(1) 


11.10 Mixture Forecasters 


We have seen that in the case of linear experts, regret bounds growing logarithmically with 
time are obtainable for the squared loss function. The purpose of this section is to show that 
by a natural extension of the mixture forecasters introduced in Chapter 9, one may obtain a 
class of algorithms that achieve logarithmic regret bounds under general conditions for the 
logarithmic loss function. 

We start by describing a general framework for linear prediction under the logarithmic 
loss function. The basic setup is reminiscent of sequential probability assignment intro- 
duced in Chapter 9. The model is extended to allow side information and experts that 
depend on “squashed” linear functions of the side-information vector. To harmonize nota- 
tion with Chapter 9, consider the following model. At each time instance t, before making 
a prediction, the forecaster observes the side-information vector x, such that ||x;|| < 1. The 
forecaster, based on the side-information vector and the past, assigns a nonnegative number 
P(Y, X;) to each element y of the outcome space V. Note that, as opposed to the rest of the 
section, we do not require that Y be a subset of the real line. The function 7; (-, x;) is some- 
times interpreted as a “density” over V, though Y does not even need to be a measurable 
space. The loss at time ¢ of the forecaster is defined by the logarithmic loss — In P; (yr, X+), 
and the corresponding cumulative loss is 


L, =— n] | POr, X;). 


t=1 


Just as in earlier sections in this chapter, each expert is indexed by a vector u € R, and 
its prediction depends on the inner product of u and the side-information vector through 
a transfer function. Here the transfer function o : Y x R —> R is nonnegative, and the 
prediction of expert u, on observing the side information x,;, is the “density” o(-, u- x;). 
The cumulative loss of expert u is thus 


L,(u) = =m] [oQ u: x). 


t=1 
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In this section we consider mixture forecasters of the form 
Pox) = f o(y.u-x quae, 
where q; is a density function (i.e., a nonnegative function with integral 1) over Rf defined, 
fort = 1,2,..., by 
—L,-1(u) 


Pe qo(uye Z qo) JIZ] Os, u- Xs) 
' Soe dv f gow) TT) os, v Xs) dv 


where go denotes a fixed initial density. Thus, p; is the prediction of the exponentially 
weighted average forecaster run with initial weights given by the “prior” density qo. 


Example 11.8 (Square loss). Assume now that yY is a subset of the real line. By considering 
the “gaussian” transfer function 


o(y,u-x)= wn 


vIn 


the logarithmic loss of expert u becomes lIn V27 + $(u -x; — y;)*, which is basically 
the square loss studied in earlier sections. However, the logarithmic loss of the mixture 
forecaster P;(y;, X;) does not always correspond to the squared loss of any vector-valued 
forecaster w,. If the initial density go is the multivariate gaussian density with identity 
covariance matrix, then it is possible to modify the mixture predictor such that it becomes 
equivalent to the Vovk—Azoury—Warmuth forecaster (see Exercise 11.18). 


The main result of this section is the following general performance bound. 


Theorem 11.10. Assume that the transfer function o is such that, for each fixed y € Y, the 
function F(z) = —Ino(y, z), defined for z € R, is twice continuously differentiable, and 
there exists a constant c such that IF (2)| < c for allz € R. Letu € R’, £ > 0, and let q£ 
be any density over R? with mean J vgi(v) dv = u and covariance matrix e?°I. Then the 
regret of the mixture forecaster defined above, with respect to expert u, is bounded as 


a nce? 7 
La, —L,@) < a + D(qallqo), 


where 


qay) 
qo(v) 


is the Kullback-Leibler divergence between qj, and the initial density qo. 


dv 


D(qullgo) = [aw In 


Proof. Fix u € Rf, and let që be any density with mean u and covariance matrix ¢7/. In 
the first step of the proof we relate the cumulative loss L, to the averaged cumulative loss 


L, qE) = f L,(v)ge(v) dv. 
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First observe that by Taylor’s theorem and the condition on the transfer function, for any 
y € Y andz, zo € R, 


F,(2) < Fy(20) + Fyo)(z — 20) + ZE — z0). 


Next we apply this inequality for z = V-x, and zo = E[V]-x;, where V is a random 
variable distributed according to the density gj. Noting that z9 = u - x;, and taking expected 
values on both sides, we have 


EFV x) < Fux) + 50%, 


where we used the fact that var(x - V) = a var(V;)x? = e? ae x? < °. Observing 
that 


X Fu x)= L and = EF, Vx) = L (98) 


t=1 t=1 


we obtain 


g nce? 
LiA(gy) < L,(u) + “37° 


To finish the proof, it remains to compare Le with L,,(q,;,). By definition of the mixture 
forecaster, 


L, — Laai = — In] [P 0x) + [corn] Joo. v-x,)dv 
t=1 t=1 


= f aom HEOI D gy 
i Pir, Xr) 


e Ti o Orn Vx) 
= l 
[ao ‘ J gow) TT, or, W > X))dw ; 


= qa (Y) In an) dv (by definition of qn) 


qoy) 

= [aw In qay) iga [ao In qa(v) po 
qo(v) qn (V) 

= D(qallqo) — D (qalla) 


< D(qallqo), 


where at the last step we used the nonnegativity of the Kullback—Leibler divergence (see 
Section A.2). This concludes the proof of the theorem. W 


Remark 11.3. The boundedness condition of the second derivative of the logarithm of 
the transfer function is satisfied in several natural applications. For example, for the gaus- 
sian transfer function o(y, z) = (1/27 e~ @-»)/2, we have F(z) = 1 for all y,z € R. 
Another popular transfer function is the one used in logistic regression. In this case 
VX = {-1, 1}, and ø is defined by o(y,u- x) = 1/(1 +e"), It is easy to see that 
|FY(z)| < 1 for both y = —1, 1. 
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Note that the definition of the mixture forecaster does not depend on the choice of gj, so 
that we are free to choose this density to minimize the obtained upper bound. Minimization 
the Kullback—Leibler divergence given the variance constraint is a complex variational 
problem, in general. However, useful upper bounds can easily be derived in various special 
cases. Next we work out a specific example in which the variational problem can be solved 
easily and gj can be chosen optimally. See the exercises for other examples. 


Corollary 11.1. Assume that the mixture forecaster is used with the gaussian initial density 
go(u) = (27)~4/2e~lll’/2. If the transfer function satisfies the conditions of Theorem 11.10, 
then for any u € Rf, 


A ALAE (+2) 
2 2 d 
Proof. The Kullback—Leibler divergence between go and any density gj, with mean u and 
covariance matrix £°7 equals 


D(qullgo) = al 


ici Gua ony a+ fa cle ay 


g 3 lul? de? 
= WV) In ga(v) dv + 5 qn) + ME pt 


The first term on the right-hand side is a the es of the differential entropy of 
the density q)(v). It is easy to see (Exercise 11.19) that among all densities with a given 
covariance matrix, the differential entropy is maximized for the gaussian density. Therefore, 
the best choice for gj (v) is the multivariate normal density with mean u and covariance 
matrix ¢7/. With this choice, 


d 
f qE(v) Ingé(v) dv = Ša In(2ree?) 


and the statement follows by choosing £ to minimize the obtained bound. E 


11.11 Bibliographic Remarks 


Sequential gradient descent can be viewed as an application of the well-known stochastic 
gradient descent procedure of Tsypkin [291] (see Bottou and Murata [39] for a survey) 
to a deterministic (rather than stochastic) data sequence. The gradient-based linear fore- 
caster was introduced with the name general additive regression algorithm by Warmuth 
and Jagota [305] (see also Kivinen and Warmuth [183]) for regression problems, and 
independently with the name quasi-additive classification algorithm by Grove, Littlestone, 
and Schuurmans [133] for classification problems. Potentials in pattern recognition have 
been introduced to describe, in a single unified framework, seemingly different algorithms 
such as the Widrow—Hoff rule [310] (weights updated additively) and the exponentiated 
gradient (EG) of Kivinen and Warmuth [181] (weights updated multiplicatively). The frame- 
work of potential functions enables one to view both algorithms as instances of a single 
algorithm, the gradient-based linear forecaster, whose weights are updated as in the dual 
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gradient update. In particular, the Widrow—Hoff rule corresponds to the gradient-based 
linear forecaster applied to square loss and using the quadratic potential (Legendre poly- 
nomial potential with p = 2), while the EG algorithm corresponds to the forecaster of 
Theorem 11.3. As it has been observed by Grove, Littlestone, and Schuurmans [133], the 
polynomial potential provides a parameterized interpolation between genuinely additive 
algorithms and multiplicative algorithms. Earlier individual sequence analyses of additive 
and multiplicative algorithms for linear experts appear in Foster [103], Littlestone, Long, 
and Warmuth [202], Cesa—Bianchi, Long, and Warmuth [50], and Bylander [43]. For an 
extensive discussion on the advantages of using polynomial vs. exponential potentials in 
regression problems, see Kivinen and Warmuth [181]. 

Gordon [131] develops an analysis of regret for more general problems than regression 
based on a generalized notion of Bregman divergence. 

The interpretation of the gradient-based update in terms of iterated minimization of 
a convex functional was suggested by Helmbold, Schapire, Singer, and Warmuth [157]. 
However, the connection with convex optimization is far from being accidental. An analog 
of the dual gradient update rule was introduced by Nemirovski and Yudin [223] under 
the name of mirror descent algorithm for the iterative solution of nonsmooth convex 
optimization problems. The description of this algorithm in the framework of Bregman 
divergences is due to Beck and Teboulle [24], who also propose a version of the algorithm 
based on the exponential potential. In the context of convex optimization, the iterated 
minimization of the functional 


min [Do (u, w1) + r6,(w)| 
ucRi 


(which we used to motivate the dual gradient update) corresponds to the well-known 
proximal point algorithm (see, e.g., Martinet [210] and Rockafellar [248]). A version of the 
proximal point algorithm based on the exponential potential was proposed by Tseng and 
Bertsekas [290]. 

Theorem 11.1 is due to Warmuth and Jagota [305]. The use of transfer functions in this 
context was pioneered by Helmbold, Kivinen, and Warmuth [153]. The good properties of 
subquadratic pairs of transfer and loss functions were observed by Cesa-Bianchi [46]. A dif- 
ferent approach, based on the notion “matching loss functions,” uses Bregman divergences 
to build nice pairs of transfer and loss functions and was investigated in Haussler, Kivinen, 
and Warmuth [153]. In spite of its elegance, the matching-loss approach is not discussed 
here because it does not fit very well with the proof techniques used in this chapter. 

The projected gradient-based forecaster of Section 11.5 has been introduced by Herbster 
and Warmuth [160]. They prove Theorem 11.4 about tracking linear experts. The self- 
confident forecaster has been introduced by Auer, Cesa-Bianchi, and Gentile [13], who 
also prove Theorem 11.5. 

The time varying potential for gradient-based forecasters of Section 11.6 was introduced 
and studied by Azoury and Warmuth [20], who also proved Theorem 11.6. The forecaster 
based on the elliptic potential, also analyzed in [20], corresponds to the well-known ridge 
regression algorithm (see Hoerl and Kennard [162]). An early analysis of the least-squares 
forecaster in the case where y; = ul x, + £, where u € Rf is an unknown target vector and 
€, are i.i.d. random variables with finite variance, is due to Lai, Robbins, and Wei [190]. 
Lemma 11.11 is due to Lai and Wei [191]. Theorem 11.7 is proven by Vovk in [300]. A 
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different proof of the same result is shown by Azoury and Warmuth in [20]. The proof 
presented here uses ideas from Forster [102] and [20]. 

The derivation and analysis of the nonlinear forecaster in Section 11.8 is taken from 
Azoury and Warmuth [20], who introduced it as the “forward algorithm.” In [300], Vovk 
derives exactly the same forecaster generalizing to continuously many experts the aggre- 
gating forecaster described in Section 3.5, where the initial weights assigned to the linear 
experts are gaussian. Using different (and somewhat more complex techniques) Vovk also 
proves the same bound as the one proven in Theorem 11.8. The first logarithmic regret 
bound for the square loss with linear experts is due to Foster [103], who analyzes a variant 
of the ridge regression forecaster in the more specific setup where outcomes are binary, the 
side-information elements x, belong to [0, 1]“, and the linear experts u belong to the prob- 
ability simplex in R“. The lower bound in Section 11.9 on prediction with linear experts 
and square loss is due to Vovk [300] (see also Singer, Kozat, and Feder [268] for a stronger 
result). 

Mixture forecasters in the spirit of Section 11.10 were considered by Vovk [299, 300]. 
Vovk’s aggregating forecaster is, in fact, a mixture forecaster and the Vovk—Azoury— 
Warmuth forecaster is obtained via a generalization of the aggregating forecaster. Theo- 
rem 11.10 was proved by Kakade and Ng [173]. A logarithmic regret bound may also be 
derived by a variation on a result of Yamanishi [314], who proved a general logarithmic 
bound for all mixable losses and for general parametric classes of experts using the aggre- 
gating forecaster. However, the resulting forecaster is not computationally efficient. It is 
worth pointing out that the mixture forecasters in Section 11.10 are formally equivalent 
to predictive bayesian mixtures. In fact, if go is the prior density, the mixture predictors 
are obtained by bayesian updating. Such predictors have been thoroughly studied in the 
bayesian literature under the assumption that the sequence of outcomes is generated by one 
of the models (see, e.g., Clarke and Barron [62,63]). The choice of prior has been studied 
in the individual sequence framework by Clarke and Dawid [64]. 


11.12 Exercises 


11.1 Prove Lemma 11.1. 
11.2 Prove that Lemma 11.3 holds with equality whenever S is a hyperplane. 


11.3 Consider the modified gradient-based linear forecaster whose weight w, at time ¢ is the solution 
of the equation 


w = argmin| De» (u, W;_1) +A e,(u) |. 


ucRi 


Note that this amounts to not taking the linear approximation of £, (u) around w,—1, as done in 
the original characterization of the gradient-based linear forecaster expressed as solution of a 
convex optimization problem. 

Prove that a solution to this equation always exists and then adapt the proof of Theorem 11.1 
to prove a bound on the regret of the modified forecaster. 


11.4 Prove that the weight update rule 


Wipe VOD 


Wit = 


11.5 


11.6 


11.7 


11.8 
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corresponds to a Bregman projection of w; = w; -1 67> 6 W-D i = 1, ..., d, onto the prob- 
ability simplex in R?, where the projection is taken according to the Legendre dual 


d 
@*(u) = ) > uj(Inu; — 1) 
i=l 


of the potential ®(u) = e"! + --- +e". 
Consider the Legendre polynomial potential ®,. Show that if the loss function is the square 
loss £(p, y) = ip — y)’, then for all w, u € R, for all y € R, and for all c > 0, 


2 
al -x, y)— Lu- x) < e (Do, U, w) — Da, (u, w')), 

where w = ®,(®,(w) — AV E(w) is the dual gradient update, and ņ = c/(a +c)Xp-— 
1) IxIiĝ) (Kivinen and Warmuth [181], Gentile [124].) Warning: This exercise is difficult. 
(Continued) Use the inequality stated in Exercise 11.5 to derive a bound on the square loss 
regret R,(u) for the gradient-based linear forecaster using the identity function as transfer 
function. Find a value of c that yields a regret bound for the square loss slightly better than 
that of Theorem 11.2. 


Derive a regret bound for the gradient-based forecaster using the absolute loss €(p, y) = 
|p — yl. Note that this loss is not regular as it is not differentiable at P = y (Cesa-Bianchi [46], 
Long [205].) 

Prove a regret bound for the projected gradient-based linear forecaster using the hyperbolic 
cosine potential. 


Prove Lemma 11.8. Hint: Set lọ = a and prove, for each t = 1,...,, the inequality 


Prove an analogue of Theorem 11.4 using the Legendre exponential potential and projecting 
the weights to the convex set obtained by intersecting the probability simplex in R? with the 
hypercube [6 /d, 1], where £ is a free parameter (Herbster and Warmuth [160]). 


Prove Lemma 11.9. 
Show that 


1. the nice pair of Example 11.6 is 2-subquadratic, 
2. the nice pair of Example 11.7 is 1/4-subquadratic. 


Consider the alternative definition of gradient-based forecaster using the time-varying potential 
w, = argmin Ọž (u), 
ucR¢4 


where 


Džu) = Do (u, Wo) + >> E(u) 


s=l 


for some initial weight wo and Legendre potential ® = ®ọ. Using this forecaster, show a 
more general version of Theorem 11.6 proving the same bound without the requirement 
V ®*(w,) = 0 for all £ > 0 (Azoury and Warmuth [20]). 

Prove Lemma 11.10. 

(Follow the best expert) Consider the online linear prediction problem with the square loss 
(W, - X,, Y) = (W: < X; — y). Consider the follow-the-best-expert (or least-squares) forecaster 
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with weights 


t 
WwW = argmin you Xs — ys) 


uceR? s] 


Show that 


t -1l t 
T 
WwW = ) XsX, ) YsXs 
s=l s=1 
T 


whenever xix, +--+ + x;x/ is invertible. Use the analysis in Section 3.2 to derive logarithmic 
regret bounds for this forecaster when y, € [—1, 1]. What conditions do you need for the x,? 


Provide a proof of Theorem 11.9 in the general case d > 1. 


By adapting the proof of Theorem 11.9, show a lower bound for the relative entropy loss in 
the univariate case d = 1. (Yamanishi [314].) 

Consider the mixture forecaster P, (-, x,) of Section 11.10 with the gaussian transfer function. 
Assume that the gaussian initial density gg(u) = (277)~4/2e~""""/2 is used. Assume that Y = 
[—1, 1]. Show that the forecaster w, defined by 


_ BAH1,) -PU x) 
4 
is just the Vovk—Azoury—Warmuth forecaster (Vovk [300]). 


t 


Show that for any multivariate density q on R? with zero mean f ug(u) du = 0 and covariance 
matrix K , the differential entropy h(g) = — f q(u) In q (u) du satisfies 


d 1 
AMS 3 In(2re) + > In det(K) 
and equality is achieved by the multivariate normal density with covariance matrix K (see, 


e.g., Cover and Thomas [74]). 


Consider the mixture predictor in Section 11.10 and choose the initial density qo to be uniform 
in a cube [—B, B]¢. Show that for any vector u with |lul|,, < B — Gd/nc)!”, the regret 


satisfies 
Tornoe nee Pai 
n` Cn Ss 7m nb. 
ee Be ag 


Hint: Choose the auxiliary density q to be uniform on a cube centered at u. 
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Linear Classification 


12.1 The Zero—One Loss 


An important special case of linear prediction with side information (Chapter 11) is the 
problem of binary pattern classification, where the decision space D and the outcome space 
Y are both equal to {—1, 1}. To predict an outcome y; € {—1, 1}, given the side information 
x, € Rf, the forecaster uses the linear classification P = sgn(w;_1 -X,), where w,_; is a 
weight vector and sgn(-) is the sign function. In the entire chapter we use the terminology 
and notation introduced in Chapter 11. 

A natural loss function in the framework of classification is the zero—one loss L, y) = 
Tsz) counting the number of classification mistakes Y Æ y. Since this loss function is 
not convex, we cannot analyze forecasters in this model using the machinery developed in 
Chapter 11. A possibility, which we investigate in Chapter 4 for arbitrary losses, is to allow 
the forecaster to randomize his predictions. As the expected zero—one loss is equivalent to 
the absolute loss Sly — p|, where y € {—1, l} and p € [—1, +1], we see that randomization 
provides a convex variant of the original problem, which we can study using the techniques 
of Chapter 11. In this chapter we show that, even in the case of deterministic predictions, 
meaningful zero—one loss bounds can be derived by twisting the analysis for convex 
losses. 

Let P be a real-valued prediction used to determine the forecast Y by Y = sgn(p), and 
then consider a regular (and thus convex) loss function £ such that Ijp2,) < L(p, y) for 
all p € R and y € {—1, 1}. Now take any forecaster for linear experts, such as one of the 
gradient-based forecasters in Chapter 11. The techniques developed in Chapter 11 allow 
one to derive bounds for the regret Ln — L,,(a), where 


L= Poy) and L, = Y eux, y). 
t=1 t=1 


Since 52); < €(p, y), we immediately obtain a bound on the “regret” 


n 
Yo Izy) — Law) 
t=1 


of the forecaster using classifications of the form Y; = sgn(p;). Note that this notion of 
regret evaluates the performance of the linear reference predictor u with a loss function 
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-1 0 1 2 


Figure 12.1. A plot of the square loss £(p, y) = (P — y)? for y = 1. This loss upper bounds the 
zero—one loss. The “normalized” hinge loss (1 — yp/y)+ is another convex upper bound on the 
zero—one loss (see Section 12.2). The plot shows (1 — p/y), for y = 3/2. 


larger than the one used to evaluate the forecaster Y. This discrepancy is inherent to our 
analysis, which is largely based on convexity arguments. 

An example of the results we can obtain using such an argument is the following. 
Consider the square loss ¢(p, y) = (P — y)*. This loss is regular and upper bounds the 
zero—one loss (see Figure 12.1). If we predict each binary outcome y, using Y; = sgn(p;), 
where P, is the Vovk—-Azoury—Warmuth forecaster (see Section 11.8), then Theorem 11.8 
immediately implies the bound 


n 


n d 
D Ipem < Dex, = ye? + lul? + YO ind +a), 
t=1 


t=1 i=1 


where Aj,..., Aq are the eigenvalues of the matrix x; x} +--+ + Xn xX, . This bound holds 


n 
for any sequence (x1, y1), (X2, y2),... € R? x {—1, 1} and for all u € R°. 

In the next sections we illustrate a more sophisticated “variational” approach to the 
analysis of forecasters for linear classification. This approach is based on the idea of finding 
a parametric family of functions that upper bound the zero—one loss and then expressing 
the regret using the best of these functions to evaluate the performance of the reference 
forecaster. 

The rest of this chapter is organized as follows. In Section 12.2 we introduce the hinge 
loss, which we use to derive mistake bounds for forecasters based on various potentials. In 
Section 12.3 we show that by modifying the forecaster based on the quadratic potential, we 
can obtain, after a finite number of weight updates, a maximum margin linear separator for 
any linearly separable data sequence. In Section 12.4 we extend the label efficient setup to 
linear classification and prove label efficient mistake bounds for several forecasters. Finally, 
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Section 12.5 shows that some of the linear forecasters analyzed here can perform nonlinear 
classification, with a moderate computational overhead, by implicitly embedding the side 
information into a suitably chosen Hilbert space. 


12.2 The Hinge Loss 


In general, the convex upper bound on the zero—one loss yielding the tightest approx- 
imation of the overall mistake count depends on the whole unknown sequence of side 
information and outcome pairs (x;, y+), t = 1, 2, .... In this section we introduce the hinge 
loss, a parameterized approximation to the zero—one loss, and we show simple forecast- 
ers that achieve a mistake bound that strikes an optimal tradeoff for the value of the 
parameter. 

Following a consolidated terminology in learning theory, we call instance any side 
information x, and example any pair (x, y), where y is the label associated with x. 

The hinge loss, with hinge at y > 0, is defined by 


£p, y) = (y — py)+ (the hinge loss), 


where p € R, y € {—1, 1}, and (x); denotes the positive part of x. Note that £, (p, y) is 
convex in p and £,/y is an upper bound on the zero—one loss (see Figure 12.1). Equipped 
with this notion of loss, we derive bounds on the regret 


n 1 n 
Sloan z inf cat a £,(u Xi, yr) 
t=1 ee ar 


that hold for an arbitrarily chosen u € R¢. 

We develop forecasters for classification that adopt a conservative updating policy. 
This means that the current weight vector w,_; is updated only when J; Æ y,. So, the 
prediction of a conservative forecaster at time ¢ only depends on the past examples 
(Xs, ys), for s < t such that Y; Æ ys. The philosophy behind this conservative policy is 
that there is no reason to change the weight vector if it has worked well at the last time 
instance. We focus on conservative gradient-based forecasters whose update is based on 
the hinge loss. Recall from Section 11.3 that the gradient-based linear forecaster computes 
predictions P; = w;_1 - X;, where the weights w,_; are updated using the dual gradient 
update 


Vb*(w,) = V*(w,_1) — AV Ly (W1), 


where V£, (w) = Vé,(w- Xx, yi). Technically, ¢,(p, y) is not differentiable at p = y/y. 
However, since our algorithms are conservative, the derivative of ¢,,(-, y) is computed only 
when p and y have different signs. Thus, we may set V£, (w) = —y, X, ly4y,}, where 
Ş = sgn(w - x,). 

The conservative gradient-based forecaster for classifications is spelled out, for a Leg- 
endre potential ®, here. For the definition of a Legendre potential function ® and its dual 
®* see Section 11.2. 
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THE CONSERVATIVE FORECASTER 
FOR LINEAR CLASSIFICATION 


Parameters: learning rate A > 0, Legendre potential ®. 
Initialization: wọ = V ®(0). 
For each roundt = 1,2,... 


(1) observe x,, set D; = W;—1 - X;, and predict Y, = sen(p;); 

(2) get y; € {-1, 1}; 

(3) if Y; Æ yn, then let w, = Vb(Vb*(w,-1) + Ayx); 
else let w, = W;_1. 


A direct application of the results from Chapter 11 would not serve to prove regret bounds 
where y is set optimally. We take a different route that can be followed in the case of 
any gradient-based forecaster using the hinge loss with a conservative updating policy. A 
basic inequality for such forecasters, shown in the proof of Theorem 11.1 using Taylor’s 
theorem, is 


A (ly (W1) = L, (0) < Au — w1): (—Véy,.(wr-1)) 
= Da (u, W-1) — Da (u, w:) + Do (Wii, Wr) 
for any u € R? and any Legendre potential $. Now observe that, at any step t such that 
sgn(p;) Æ yr, the hinge loss £, (U) = (y — y: u - x;)+ obeys the inequality 
Y — ly) = y — (Y — yU X) < yu x, = U (V Ly Wi)). 
Therefore, 
A(y — £y.(W)II5,4y,) < Aus (—VeyCwr-1)) 
< Au — wi): (VLW) 
= Da (U, W-1) — Da (U, W,) + Do (Wi-1, Wr) - 
To understand the second inequality note that V£, # 0 only when sgn(p,) # y: and, in this 
case, W1 © Vly (W1) = — Yr Wr-1 - X; > 0. The equality holds when V£, , = 0 because, 
in that case, w,—1 = Wz. 
The advantage of this approach is seen as follows: if we multiply both sides of 
Ay = &.C)Ui5,499 S AU + (=V£y. (Wi) 
by an arbitrary extra parameter œ > 0, and proceed as above, we obtain the new inequality 
wA(y — €y:(w))M5,45,) < Do (wu, w1) — Do: (wu, w) + DoW, W). 


Note that on the right-hand side u is now scaled by a. Summing for t = 1, ... , n and using 
some simple algebraic manipulation, we obtain 


n 


X (aay = Do (Wi1, W)) 5,45, < CAL y nU) + Do (wu, wo). (12.1) 


t=1 
This is our basic inequality for the hinge loss, and we now apply it to different potential 
functions. 
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Polynomial Potential and the Perceptron 

Consider first the conservative forecaster for classification based on the Legendre poly- 
nomial potential ®,(u) = 5 llull;,. where p > 2. The conservative update rule for this 
forecaster can be written as 


We = VP p(V Pa (Wii) + AVX Tgrt). 


It turns out that for the polynomial potential linear classification is, in some sense, easier 
than the linear pattern recognition problem of Chapter 11. In particular, a constant learning 
rate (A = 1) is sufficient to obtain a good bound for the number of mistakes. Unfortunately, 
as we see, this is not the case for the exponential potential, which still needs a careful choice 
of the learning rate. 


Theorem 12.1. If the conservative forecaster using the Legendre polynomial potential ® , 
is run on a sequence (X1, y1), (X2, y2)... € R? x {—1, 1} with learning rate à = 1, then 
for alln > 1, for allu € R4, and for all y > 0, 


n 
Yo lst 
t=1 


En 2 X a aes 
< Lr 1p v(7 e fulg) + Jo- o (22 1n) Braw 
Y Y Y 


where X p = max;—1 


n Xl» and q = p/(p — 1) is the conjugate exponent of p. 


Note that this bound holds simultaneously for all u € R? and for all y > 0. Hence, in 
particular, it holds for the best possible y for each linear classifier u. Note also that this 
bound has the same general form as of the bound stated in Theorem 11.2 in which A is 


set optimally for each choice of y and u. So, linear classification does not require the 
self-confident tuning techniques that we used in Section 11.5. 


Proof. We start from inequality (12.1). Following the proof of Theorem 11.2, for any t 
with }, Æ yr, we upper bound D»«(w;_1, Wz) as follows 


1 
Do-(W;-1, Wr) < P-L Jve, w- DI, < Poz. 


Now apply this bound to (12.1) for u € R¢ arbitrary and with à = 1. This yields 


n 


=i lu ak 
» (ev z ox 2) Ion Ss aL y,n(U) Sa —+ a, 


t=1 


where we used the equality Do«(au, Wo) = as lul. 
Setting a = (2e +(p—1)X 2) /(2y) for € > 0 (to be determined later), and dividing by 
£ > 0, gives 


n 2 
ee Lyn) | (P- 1)X? L,,,(u) fe (2e + (p — 1)X°) lul? 
ety = y Jg y 4ey? 2. 
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To minimize the bound, set 


2 
es Jc DyL w+ (27x ul ) 
e= —_ — h — . 
llull, Y, 2 P q 


With an easy algebraic manipulation we then get 


i L nu) P a 1 Xp ž 
lg) S = + lul 
2 Sry y 2 y q 


„žo le Ie pea Ca iy 
y y 2y 


Using Ja +b < Va + vb for all a, b > 0 we get the inequality stated in the theorem. 
Since u and y > 0 were arbitrary, the proof is concluded. W 


The forecaster of Theorem 12.1 is also known as the p-norm Perceptron algorithm. For 
p = 2, this reduces to the classical Perceptron algorithm, whose weight updating rule is 
simply w; = wW:—1 + y:Xrlt5,45,;. Note also that, for p = 2 and L, „(u) = 0, the bound of 
Theorem 12.1 reduces to 


2 
n max _||x;|| lull 
tilaat 


X lise < 


t=1 


K 


In this special case, the result is equivalent to the Perceptron convergence theorem (see 
Section 12.6). More precisely, L,,,(u) = 0 implies that the sequence (x1, y1), .-., Xn, Yn) 
is linearly separated by the hyperplane u with margin at least y. The margin of the linearly 
separable data sequence with respect to the separating hyperplane u is defined by min, y,u - 
x;/ ||u||. The Perceptron convergence theorem states that the number of mistakes (or, 
equivalently, updates) performed by the Perceptron algorithm on any linearly separable 
sequence is at most the squared ratio of (1) the radius of the smallest origin-centered 
euclidean ball enclosing all instances and (2) the margin y of any separating hyperplane u 
(see Figure 12.2). 

The dynamic tuning à; = 1/ ||x;|| is known to improve the empirical performance of 
the Perceptron algorithm. Indeed, we can easily extend the analysis of Theorem 12.1 (see 
Exercise 12.1) and prove a bound that, in the case of linearly separable sequences, can be 
stated as 


n 2 
5i TEE E lix: I| llull 
{yEy} = t=1 i Yı ux, ’ 
t= 


Pa 


where u is any linear separator. The improvement is clear when we rewrite the bound for 
the Perceptron with static tuning à = 1 as 


2 
n max ||x;|| lull 
t=1,, 3,7 


min (ru: x) 
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Figure 12.2. The instances x, € R? of a linearly separable sequence (X1, y|),..., (Xn, Yn). Empty 
circles denote instances x, with label y, = 1 and filled circles denote instances with label 
yı = —1. A separating hyperplane, passing through the origin, with margin y is drawn. The Per- 
ceptron convergence theorem states that, on this sequence, the Perceptron algorithm will make at 
most (R/y)? mistakes or, equivalently, perform at most (R /y} updates. 


Exponential Potential and Winnow 

We proceed by considering the conservative linear forecaster based on the Legendre expo- 
nential potential P(u) = e"! + ----+ e". Unlike the analysis of the polynomial potential, 
here the choice of the learning rate makes a difference. Indeed, our result does not address 
the issue of tuning A, and the regret bound we prove retains the same general form as the 
bound proven in Theorem 11.3 for the linear pattern recognition problem. 


Theorem 12.2. Assume the conservative forecaster using the Legendre exponential poten- 
tial ®(u) =e" +---+e" with normalized weights is run with learning rate à = 
(2ye)/X2,, where 0 < & < 1, on a sequence (X1, y1), (X2, y2)... € R? x {—1, 1}. Then 
for all n > 1 such that X% > Max;=1,...n ||Xr||oo and for all u € R? in the probability 


simplex, 


Poe 


Dh: /Xx\ hd 
Ir < Y: 
2 {Py} = 1 nrg + ( y ) 2e(1 JE £) 


Proof. Following the proof of Theorem 11.3, we upper bound Do»(w;—1, w+) as follows: 


Ay 
Do«(Wi-1, Wr) < Dox (w1, w,) < 7 Xoo: 


where Wit = Wi7-1e° V&W): and w, is w, normalized (as in the proof of Theorem 11.3, 
we applied the generalized pythagorean inequality, Lemma 11.3, in the second step of this 
derivation). Now apply this bound to (12.1) for œ = 1 and for any u € Rf in the probability 
simplex. This yields 


a2 
x (xv — 5x2) Lisiy) < AL; n0) + Ind, 


t=1 
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where we used the inequality Do«(u, Wo) < Ind for Wo = (1/d,..., 1/d). Dividing both 
sides by Ay > 0 and substituting our choice of à yields the desired bound. E 


The forecaster used in Theorem 12.2 is a “zero-threshold” variant of the Winnow algo- 
rithm. In the linearly separable case, the bound of Theorem 12.2, with € = 1/2, reduces 
to 


max, [Xho 


2 
n 
D Inen <2| === ] Ing, 
t=1 


where y is the margin of the hyperplane defined by u. A comparison with the corre- 
sponding bound for the Perceptron algorithm reveals an interplay between polynomial and 
exponential potential analogous to that discussed in Section 11.4. 


The Second-Order Perceptron 

We close this section by showing an analysis of the conservative classifier based on the 
Vovk—Azoury—Warmuth forecaster (see the discussion at the end of Section 12.2). Fol- 
lowing the terminology introduced by Cesa-Bianchi, Conconi, and Gentile [47], we call 
this classifier second-order Perceptron. At each time step t = 1, 2,... the second-order 
Perceptron predicts with J, = sgn(W;' x,), where 


t—1 t—1 
W =A Yo VX and A= (a +Ý xx, ly) +X x) 


sel s=1 


(following the notation introduced in Chapter 11, we use W, instead of W;_; to denote the 
weight vector of this forecaster at time t). Note that we have introduced a parameter a > 0 
multiplying the identity matrix /. Even though this parameter is not used in the analysis of 
the second-order Perceptron, it becomes convenient when we compare the behavior of this 
algorithm with that of the standard Perceptron. 

Rather than using the inequality (12.1), we follow the arguments developed in 
Section 11.8 for the analysis of the Vovk-Azoury-Warmuth forecaster. 


Theorem 12.3. If the second-order Perceptron (the conservative Vovk-Azoury-Warmuth 
forecaster) is run on a sequence (Xj, y1), (X2, Y2)... € R? x {-1, 1}, then for alln > 1, 
for allu € R4, and for all y > 0, 


d 


n Xi 
DONE = a a y (a lul? T u” A,„u) Py In (1 pa +), 


i=1 


where h,,..., Aq are the eigenvalues of the matrix 
n 
= 
An = >_> xi x Tigy) 
t=1 


The bound stated by Theorem 12.3 is not in closed form as the terms Iis, +y,} appear on both 
sides. We could obtain a closed form by replacing A, with the matrix x; Xx, tee $x, x! 
including all instances. Note that this substitution can only increase the eigenvalues 
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À1, .-., Aa. However, the closed-form bound is probably too weak to reveal the improve- 
ments brought about by the second-order Perceptron analysis with respect to the classical 
Perceptron (see the end of this subsection for a detailed comparison of the two bounds in a 
special case). 


Proof of Theorem 12.3. We follow the notation of Section 11.8 with the necessary adjust- 
ments because we have introduced the new parameter a and we are considering conservative 
forecasters. So, in particular, define 


2 1 t-1 A 
p7 (u) = É lull? + 7 X (u'x, = Ys) haan ` 


s=l 


This potential is connected to W, by the following relation (see the proof of Theorem 11.8): 


Tepes 2 : i EE 
5 Pe x =y) gy = inf D7) — inf DO) + 5x) AP X Tay 


1 T 4-1 ew Tas aN 2 
75 (x; Az’) (W, X;) Ligy} 


where, if Vy; = y,, the equality holds because inf, (v) = inf, ®* (v). We drop the last 
term, which is negative because A,_; is positive definite, and sum over ¢ = 1,...,7, 


obtaining, for any u € Rf, 
1X 2 
ST 
z Yo (FX = ye) Tis 490 
t=1 


. ; loot, 

< inf ©, (v) — inf &;(v) + 5 2s 4; Xi Tsy 
: 1 n x A 

pi 0) + 2 DS X, A; x Is, Ay.) 


t=1 


IA 


n 


a 1 2 _ _ 
= 5 lull’ +;5 S (ux, =») Ts + J ox Ar x Tigy 


t=1 t=1 
where we used infy ®}(v) = 0. Expanding the squares and performing trivial simplifica- 


tions, we get to the following inequality: 


RETR = 1 7 
2 » ((W)x,)" = 2y, 9) x) liz) S 5 |e lull? + Y ux) ton | 
t=1 


t=1 
n 1 n 
T Tail 
= oyu xX Lisy) + 2 2x A, Xr Lisiy): 
t=1 t=1 


Note that the left-hand side of this inequality is a sum of positive terms, because 
—2y,W) x; > 0 whenever Y,  y,. In addition, we can write 


1 7 2 1 “ 1 
2 e lul? ag J (u'x,) ton | = z" ( T yo x! kin] u= z" Anu 


t=1 t=1 
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and, using Lemma 11.11, 


n d 
1 f 1 
5 Dx Ar x Tipy $ a uh (1+ *) 


t=1 i=1 


This allows us to write the simpler form 


n d 
1 1 Ài 
0 < 5u (al + Ana — ya x Toy + 5 3 j ( + =) 


t=1 


Since u was chosen arbitrarily, this inequality also holds when u is replaced by au, where 
a > 0 is a free parameter. Performing this substitution, we end up with 


2 
0< ^u (I +A, way ai X ligy} + = sdm(i4 =). 


t=1 


To introduce hinge loss terms, observe that — yal x; < ly) — y for all y > 0. Sub- 


stituting this into the above inequality, rearranging, and dividing both sides by ay > 0 
yields 


n 
5 Iy,4y,) = 
t=1 


Substituting the choice 


wa tess f(s) 


J2 iln (1 + A;/a) 
a= 
u' (I + A,)u 


implies the claimed bound. W 


We now compare the bound of Theorem 12.3 with the corresponding bound for the 
Perceptron algorithm (Theorem 12.1 with p = 2). 


Consider the simple case of a sequence (X1, y1),..., Xn, Yn) where ||x;|| = 1 for all t 
and such that there exists some u € R4, with ||ul| = 1, satisfying y; u! x; > y > 0 for all 
t =1,...,n. For this linearly separable sequence, the bound of Theorem 12.3 may be then 
written as 


1 d jE 
X Tiy) < 7 (a +uTAnu) X In (: + *), (12.2) 


t=1 i=1 


Recall that this bound is not in closed form as A, (and its eigenvalues) depend on the 
mistake terms Ix5,z,,). Then let m be the largest cardinality of a subset M C {1,..., n} 
such that (12.2) is still satisfied when a mistake is made on each t € M. Moreover, let 
m' = 1/y?, where 1/y? is the Perceptron bound specialized to the case ||u|| = 1 and 
|x;|| = 1 for all ¢. We want to investigate conditions on the sequence guaranteeing that 
m < m'. To do that, we represent m’ as the unique positive solution of 
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Thus m < m’ whenever 


d 
(a+u'A ae (1+ *) <m. (12.3) 


Now note that, since ||u|| = 1 and ||x; || = 1 


n 
u A,u= CES Igy) < m 


t=1 


Using (u Tx) = (y.u Tx) > y?, we a y?m <u'A,u < m for allu € R°. Hence, 
we may write u! A„u = æm for some y? <a <1. Since ||x,|| = 1 also implies A; + 

-+ àq =m, we may set A; = a;m, where the coefficients a,,...,a@ > 0 are such that 
a, +---+ a, = 1. Performing these substitutions in (12.3) we obtain 


d 
(a +am) n(1 g 
i=1 


Ifam =u! A„u is small compared with the large eigenvalues of A,, then there exist choices 
of a that satisfy (12.4) (see Exercise 12.5). 

This discussion suggests that the linearly separable sequences on which the second-order 
Perceptron has an advantage over the classical Perceptron are those where linear separators 
u tend to be nearly orthogonal to the eigenvectors of A, with large eigenvalues. In such 
sequences, a large share of instances x, must thus have the property that y; u! x, is close to 
the minimum value y. 

As a final remark note that the bounds of Theorems 12.1 and 12.3 are invariant to 
simultaneous rescalings of y and ||u|| that do not change the ratio ||u|| /y (Gn Theorem 12.1 
this ratio should take the form ||u||, /y). This is what we expect, because the loss L, „(u)/y 
exhibits the same kind of invariance. Hence, we do not lose any generality if these results 
are stated with y set to 1. 


“) 2m. (12.4) 


12.3 Maximum Margin Classifiers 


In this section we study the scenario in which a forecaster is repeatedly run on the 
same sequence of examples. More specifically, we say that a forecaster is cyclically run 
on a “base sequence” (X1, y1),---; (Kn, Yn) E€ R? x {—1, 1} if it is run on the sequence 
i 91), (X,Y)... -, Where Ogag Yenye) = Xr; Yr) for all k > Oandt =1,...,n. 

If the base sequence is linearly separable, then the mistake bound for a conservative 
forecaster tells us how many updates are performed at most before the forecaster’s current 
classifier converge to a linear separator of the base sequence. For example, the Perceptron 
convergence theorem (see Section 12.2) states that at most (max, ixl / vy) updates are 
needed to find a linear separator for any sequence linearly separable with margin y > 0. 
However, the results of Section 12.2 do not provide information on the margin of the 
separator found by the forecaster. 

The question addressed here is whether we can modify the forecasters of Section 12.2 
so that the classifier obtained after the last update has a margin close to the largest margin 
achievable by any linear separator of the sequence. 
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Assume that the sequence (x1, y1), (X2, y2),... € R? x {-1, l} is linearly separable by 
u € R? such that ||x,|| = 1 for all t£. We now show that the following algorithm, a simple 
modification of the Perceptron, when cyclically run on a sequence with margin y, finds a 
linear separator with margin (1 — a)y after at most 1/(@y)? updates, where œ is an input 
parameter. Following the terminology of Gentile [123], we call this modified Perceptron 
ALMA (approximate large margin algorithm). 


THE ALMA FORECASTER 
Parameter: a € (0, 1]. 
Initialization: wọ = (0,...,0),k = 1. 
For each roundt = 1,2,... 
(1) y = (V8/K) /a; 
(2) observe x,, set D; = w;—1 - X;, and predict with Y, = sgn(p;); 
(2) get label y, € {—1, 1}; 
(3) if y; W1 +X; < (1 — @)y;, then 
(3.1) m = V2/k and w, = wi + Yr Xr; 
(3.2) w, = wi/ |w; 
(3.3) k<k+1; 
(4) else, let w, = W;_1. 


> 


Theorem 12.4. Suppose the ALMA forecaster is cyclically run on a sequence 
(X1, Y1), +++ En, Yn) € RY x {—1, 1} with ||x,|| = 1 for all t, linearly separable by u € R4 
with margin y > 0. Let m be the number of updates performed by ALMA on this sequence. 
Then the number of mistakes m = Y°°~; l,5,4y/ is finite and satisfies 


Furthermore, let s be the time step when the last update occurs. Then the weight W, computed 
by ALMA at time s is a linear separator of the sequence achieving margin (1 — @)y. 


Proof. For any t = 1,2,..., let N; = ||w,||, and y,(u) = y, u- x;. We first find an upper 
bound on m by studying the quantity u- w,. Choose any round ¢ such that y; w;_1 - x; < 
(1 — a)y,. Then 


Us Wi Nyu: X x uU: Wi Fy 
N, E N, 


u- w, = 
and 
N? = |w 
= Iwi + y: x|? 
= 1+ 07 + 2m ye Wii: X; 
< 14+ +20 a)y. 
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The inequality holds because an update at time t implies that y; w,;_1 - x, < (1 — æœ)yı. 
Substituting the values of yn, and y, in the last expression, we obtain N? < 1+2A/k, 


where A = 4/a — 3 and k; is the number of updates performed after the first ¢ time steps 
Now we bound m by analyzing u - w, through the recursion 


u: Ws-1 + NsV 
JI +2A/m 


os MEE a Y 
— JIF2A]Jm fm/2+ A 


Solving this recursion, while keeping in mind that wọ = Oandu - w, = u - w,;_ if no update 
takes place at time ¢, we obtain 


m 


1 
TeL nora | yay 


k+1 


where for k = m the product has value 1. Now. 


m 


1 2A 
—In = Inj 1+ — 
Iaeaq- can 


j=k+1 


<- DPE — (since In(1 + x) < x for all x > —1) 
25 =k+1 


re dx 
A pe 
jak X 


m 
=Aln—. 
k 


Therefore, 


I gers G) 
ee a AS 


Now, since u- w, < 1, we obtain 


(k/my4 
12) pred 

” (k/m)A 
p ORTF Jmj2+A 


> al kA dk 
mA /m/2+A Jo 
y m 


A+1 /m/2+A- 
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Solving for m yields 


A+1) A+ 1) A+1)*A 
EN P ( PAA +1) 
4y2 16y4 y? 


(A+D? (A+1ọ2 JA+1 
< + + 


1 
4y? Y 16y? 
(A+1°F (A+1ř° 

< XA+1 

=a + ay? +2(A + 1), 


where we used the inequality vx + 1 < /x +1/ (25/5) for x > 0 in the last step. Substi- 
tuting our choice of A yields the desired result. 
To show that w, is a linear separator with margin (1 — a)y, note that 


1/8 
Ys = 4/ 
m 


a 

1 8 

ay HÈ 421441) 
1 Y 


ay] GED 4 AI 
e teg 


IV 


IV 


(since 0 < y < 1) 


= Y 
T ia 


Zy 


Since the last update occurs at time n, this means that y,w;-x, >(1—a)y 
foralt>s. E 


12.4 Label Efficient Classifiers 


In Section 6.2 we looked at prediction in a “label efficient” scenario where the forecaster 
has limited access to the sequence of outcomes y1, y2, .... Using an independent random 
process for selecting the outcomes to observe, we have been able to control the regret of 
the weighted average forecasters when an a priori bound is imposed on the overall number 
of outcomes that may be observed. 

In this section we cast label efficient prediction in the model of linear classification with 
side information: after generating the prediction Y; = sgn(p;) for the next label y; given 
the side information x,, the forecaster uses randomization to decide whether to query y 
or not. If y, is not queried, its value remains unknown to the forecaster, and the current 
classifier is not updated. It is important to remark that, as in Section 6.2, in this model also 
the forecaster is evaluated by counting prediction mistakes on those time steps when the 
true labels y, remained unknown. 
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We study selective sampling algorithms that use a simple randomized rule to decide 
whether to query the label of the current instance. This rule prescribes that the label should 
be obtained with probability c/(c + |p,|), where P, is the margin achieved by the current 
linear classifier on the instance, and c > 0 is a parameter of the algorithm acting as a scaling 
factor on pr. Note that a label is sampled with a small probability whenever the margin is 
large. 

Unlike the approach described in Section 6.2, this rule provides no control on the number 
of queried labels. In fact, this number is a random variable depending, through the margin 
Pı, on the interaction between the algorithm and the data sequence on which the algorithm 
is run. Owing to the complex nature of this interaction, the analysis fails to characterize the 
behavior of this random variable in terms of simple quantities related to the data sequence. 
However, the analysis does reveal an interesting phenomenon. In all of the label efficient 
algorithms we analyzed, a proper choice of the scaling factor c in the randomized rule yields 
the same mistake bound as that achieved by the original forecaster before the introduction 
of the label efficient mechanism. Hence, in some sense, the randomization uses the margin 
information to select those labels that can be ignored without increasing (in expectation) 
the overall number of mistakes. 

To provide some intuition on how the randomized selection rule works, consider the 
standard Perceptron algorithm run on a sequence (X1, y1),.--, (Xn, Yn) € R¢ x {-1, 1}, 
where we assume ||x || = 1 for t = 1,...,”. For the sake of simplicity, assume that this 
sequence is linearly separated by a hyperplane u € R°. Recall that P, = w;_1 - x;. A basic 
inequality controlling the hinge loss in this case (see Section 12.2, and also the proof of 
Theorem 12.5 below) is 


(= Bde < 5 (Ite wil? = lu — w? +1). 

Now, if y; P, < 0 and |p;| is large, then the hinge loss (1 — y,p;)+ is also large. This 
in turn implies that the difference |/u — w;_; |? — lu — w|? must be big. This means 
that ||u — will? drops as w;_; is updated to w,. So, whenever the Perceptron makes 
a classification mistake with a large margin value |p,|, the weight w,_; is moved by 
a significant amount toward the linear separator u. In this respect, mistakes with large 
margin bear a bigger progress than mistakes with margin close to 0. On the other hand, 
the standard Perceptron algorithm does not take into account the information brought 
by |p;|. The basic idea underlying the label efficient method is a way to incorporate this 
information into the prediction by using the size of |p; | to trade off a potential progress witha 
spared label. 

We now formally define and analyze a label efficient version of the Perceptron algorithm. 
Similar arguments can be developed to prove analogous bounds for other conservative 
gradient-based forecasters analyzed in this chapter (see the exercises). 

As our forecasters are randomized, we adopt the terminology introduced in Chapter 4. 
The forecaster has access to a sequence Uj, U2,... of i.i.d. random variables uniformly 
distributed in [0, 1]. The decision of querying the outcome at time t is defined by the value 
of a Bernoulli random variable Z, of parameter q; (where q; is determined by U;,..., U;—-1 
and by the specific selection rule used by the forecaster). To obtain a realization of Z,, 
the forecaster assigns Z; = 1 if and only if U, € [0,q,). The sequence of outcomes is 
represented by the random variables Y,, Y2,..., where each Y, is measurable with respect 
to the o-algebra generated by U,,..., U;_1. This implies that Y, is determined before the 
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value of Z, is drawn. Our results hold also when instances x; are measurable functions of 
U,,..., U;-1. However, to keep the notation simple, we derive our results in the special 
case of arbitrary and fixed instance sequences. 


THE LABEL EFFICIENT PERCEPTRON ALGORITHM 
Parameter: c > 0. 
Initialization: wo = (0, ..., 0). 
For each round t = 1,2,... 


(1) observe x,, set D; = w;—1 - X;, and predict with Y, = sgn(p;); 

(2) draw a Bernoulli random variable Z, € {0, 1} of parameter c/(c + [P;|); 
(3) if Z, = 1, then query label Y, € {—1, 1}, and let w, = w,_1 + Y; x, Its,zy,}; 
(4) if Z, = 0, then w, = w;_}. 


Theorem 12.5. If the label efficient Perceptron algorithm is run on a sequence 
(xı, Y1), &2, Yo)... € R? x {-1, 1}, then for all n > 1, forall u € RI, and for all y > 0, 
the expected number of mistakes satisfies 


n 2 2 2 2 
L,,(u X Ly, lull“ (2c + X 
5 È toan) < UR Lem ( ) 
t=1 


’ 


y 2c y 8cy? 


where X = MaX;=1,...n ||Xrl|- 


Note that by choosing 


O xX X |jull \7 
c= Xe rt pate T ( 2 ) 


one recovers (in expectation) the bound shown by Theorem 12.1 (in the special case 
p = 2). However, as c is an input parameter of the algorithm, this setting implies that, at the 
beginning of the prediction process, the algorithm needs some information on the sequence 
of examples. In addition, unlike the bound of Theorem 12.1 that holds simultaneously for 
all y and u, this refined bound can only be obtained for fixed choices of these quantities. 


Proof of Theorem 12.5. Introduce the Bernoulli random variable M, = lj5,zy,;. We start 
from the chain of inequalities 


y — ly) <u- (—V2ey,.(w,-1)) 
< (u — w1) (VL; (w-1) 


= Dæ (Uu, w1) — Da (U, W;) + Dox(Wi_1, Wr), 


which we used for the derivation of (12.1) in Section 12.2. This holds for any conservative 
gradient-based forecaster on any time step ¢ such that M, = 1. 

Just like the Perceptron, the label efficient Perceptron uses the quadratic potential ®(v) = 
@*(v) = 5 i|v||?, and thus Do«(u, v) = 5 lu — vil. Consider now a time step £ where the 
label efficient Perceptron queries a label and makes a mistake. Then Z; = 1, M, = 1, 
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and —V £; (W:-1) = Y;x;. Hence, we may rewrite this chain of inequalities as follows: 


y= £0) 
<Y, u- x, 


= Y(U — W,-1 + W-1) X 


y W1: X; + u W,— u W + Wi W . 
t "t—1 t 2 t—1 2 t 2 t-1 t 


Note that this time we obtained a stronger inequality by adding and subtracting the negative 
term Y, w;_1 - X; = Y, D;. The additional term provided by this more careful analysis is the 
key to obtain the final result. 

Using Y, P, < 0 and replacing u with au for œ > 0, we obtain the inequality 


(ay a IPrl)M,Z, 


1 1 1 
< aly,(w) + 5 llau will? 5 llu wl? + awas wll? 


that holds for all time steps t. Indeed, if M,Z, = 0 the inequality still holds because 
œl, (U) > 0 and w;_; = w;. Summing for t = 1,..., n, we get 


n 


a a ee 
ilar + IPI)M, Z, < Ly nw) + > lul? + 5 DO wi — wil? 


t=1 t=1 
where &? |ual] = ||au — wol| and we dropped — ||au — wll? /2. Finally, since M,Z, = 0 
implies ||w,_; — w;|| = 0, using ||w,;_; — w, ||’ < X? we get 


n 


xX? a 
~ 2 
` (or + |P:l -— ar )m, Z, < aL, nU) + E lull. 
t=1 


Now choose œ = (c + X?/2)/y for some c > 0 to be determined. The above inequality 
then becomes 


n 2 
me cLy (a) X? Lyn) lal? (2c +X? 

X (c+ IP) Z, < m y R ( z yo 

Fa 2 y 8y 
We now take expectations on both sides. Note that, by definition of the algorithm, E, Z, = 
c/(c + |p;|), where we use E, to indicate conditional expectation given U4, ..., U,—1. Also, 
M, and P, are measurable with respect to the o-algebra generated by U4, ..., U;_,. Thus 
we get 


| Ste mz] = [e+ BM xz] = | Sem. 


t=1 t=1 


Dividing both sides by c, we arrive at the claimed inequality 


n ise X? L,, ull? (2c + x2) 
| 50m, | < yan) yn), [ull ( i = 
y 2c y 8cy? 
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Figure 12.3. A set of labeled instances x, € R is shown on the abscissa as empty circles (label —1) 
and filled circles (label +1). This set is not linearly separable in R. However, by mapping each x, € R 
via ġ(x) = (x, 1+ xJ/2 + x?) we obtain a linearly separable set in R2. The coefficient 5/3: is chosen 
so that inner products ¢(x)@(x’) between mapped instances can be computed using the polynomial 
kernel function K (x, x’) = (1 + xx’). 
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Kernel functions are an elegant way of turning a linear forecaster into a nonlinear one with 
a reasonable computational cost. As a motivating example, consider the following simple 
reduction from quadratic classifiers in R°? to linear classifiers in RÉ. In R?, a quadratic 
classifier f : R? — {—1, 1} is defined by 


f (1, x2) = sgn(p(x1, x2)), 


where p(x1, x2) = Wo + w1x1 + W2X2 + W3X1X2 + w4x? + wsx? is any second-degree 


polynomial in the variables x; and x2. The decision surface of f is the set of points 
(x1, x2) € R? satisfying the equation p(x1, x2) = 0. The decision surface of a linear clas- 
sifier is a hyperplane, whereas for quadratic classifiers the decision surface is a conic (the 
family of curves to which ellipses, parabolas and hyperbolas belong). To learn a quadratic 
classifier with a linear forecaster, it is enough to observe that p(x1, x2) of the above form 
can be written as w- x’ for w = (wọ, W1, -.., Ws) and x’ = (1, X1, X2, X1X2, xa x3). Thus, 
we can transform each instance x; = (X1,;, X2,1) via the mapping 


2 2, 
Cis X21) = (1, ii tay p X21) = X, 


and then run the linear forecaster on the transformed instances x; € R®° instead of the 
original instances x, € R? (see Figure 12.3 for a 1-dimensional illustration). The vector x’ 
is often called a feature vector, and in the example considered, R° plays the role of the 
feature space. 
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This simple trick can be easily generalized to learn any kth-degree polynomial decision 
surface in R?. However, the computational cost of implementing the mapping ¢, even for k 
moderately large, is too high. In fact, (5) coefficients are needed to represent a kth-degree 
polynomial surface in R, implying that we have to run our linear forecaster on instances 
of dimension exponentially large in k. 

Computational problems nearly disappear if the classification of x,, at an arbitrary time 
t, can be computed using only inner products between instances. For example, the classifier 


computed by the Perceptron algorithm at time t can be written in the form 


f(x) = sgn (x Oj V1, Xt; ° s) , 


where œ; € R and the sum ranges over a subset of the instance sequence X1, ..., X;-1. 
Now suppose the Perceptron is run on the transformed instances x, = $(x,), where we have 
rewritten the ¢ of our initial example as $(x1, x2) = (1, Xı V2, xoV/2, x1x2V/2, Kes x2). Note 
that the introduction of the scaling coefficients /2 makes no difference for the learning 
problem faced by the forecaster. Then, as $(x,) - (x) = (14+ x, - x)”, we can avoid the 
computation of any $(x;). Indeed, 


f(x) = sgn (x Oi Yn ln) + so) = sgn (£ ot: Yy (l H Xg - v) s 


L 


In general, if ¢ maps x € R? to ¢(x) = x’ whose components are all the monomials 
of a kth-degree polynomial in the variables x (with suitable scaling coefficients), then 
PX) - (x) = (1+ x, -x)*. Hence, the forecaster can learn a polynomial classifier without 
ever explicitly computing the coefficients of the polynomial curve. 

Note that saying that a forecaster manipulates the transformed instances (x) using 
only inner products implies that the computation performed by the forecaster is invari- 
ant to transformations that map the instance sequence (X1, X2,...) to (Ax;, AX2,...), 
where A performs a change between two orthonormal bases. To see this, note that 
(Au)'(Av) = u'A' Av = u' v. Such forecasters are sometimes called rotationally invari- 
ant. Unfortunately, not all gradient-based forecasters for classification are rotationally 
invariant. In particular, among the linear forecasters studied in this chapter, only the Per- 
ceptron and the second-order Perceptron have this property (see Exercise 12.12). 

In view of extending this approach to surfaces that go beyond polynomials, we investigate 
the conditions guaranteeing that a symmetric function K : R? x R? — R has the property 
K (u, v) = (¢(u), (v)) for all u, v € R? and for some ¢ mapping R’ to a Hilbert space (we 
use (-,-) to denote the inner product in this space). We call kernel any such function K. 
Note that we changed the range of ġ from a finite-dimensional euclidean space to a Hilbert 
space. In this space, our transformed instances (x) are vectors with possibly an infinite 
number of components. This allows, for example, to learn a certain class of infinite-degree 
polynomial decision curves. For reasons that are made clear in the proof of the following 
result, the Hilbert space H associated to a kernel function is called reproducing kernel 
Hilbert space. 

It turns out that a simple characterization of kernels exists. 
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Theorem 12.6. A symmetric function K : R! x R! => R is a kernel if and only if for all 
n € Nand for all x, ...,Xn € R? the n x n matrix K with elements K (x;, xj) is positive 
semidefinite. 


Proof. Assume first that K : R? x R¢ — R is such that K (u, v) = (#(u), ¢(v)). Fix any 
positive integer n € N, choose xj,...,X, € RI arbitrarily, and let K be the associated 
matrix. Then, for all u € R4, 


n 
u' Ku = ) K (Xi, X juju; 
i,j=l 
n 


= 5 (9x), p&p) uiu; 


i j=l 


(Eram). X oau; ) 
i=l j=l 


> 0. 


i 2 
Yo oau; 
i=l 


Hence K is positive semidefinite. 

Assume now K : R? x R? > R is such that, for any choice of X1, ...,Xņ € R7, the 
resulting kernel matrix is positive semidefinite. Introduce the linear space V of functions 
f : R? = R defined by 


fO=}J_ æK,- where né€N, a, €R,i=1,...,n, 


i=1 


where we set a(f + g)(x) =a f(x) +a g(x) for any œ € R and u € R. We now make V 
an inner product space. Introduce the operator ( -, - } such that, for any two f, g € V defined 
by 


fO=J aK, ) and g=) Kj >), 
i=1 j=1 


we have 


(ee) eae at; B; K (uj, vj). 


Clearly, (-,-) defined in this way is real valued, symmetric, and bilinear. In addition, for 
all f ey, 


m n 


(ff) = 5 do aja) K(u;, uj) = «' Ka > 0 


i=1 j=1 


because K is positive semidefinite by assumption. Thus, to verify that (-,-) is indeed an 
inner product on V, we just have to show that (f, f) = 0 implies f = 0. To see this, first 
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note that 


(f, K(x, -)) = > aiK (x, uj) = f(x) (reproducing property), 
i=l 

where the first equality follows from the definition of ( -, -) by taking g(-) = K (x, -). Hence, 
f(x? = (f, K(x, -))” < (f, f) K (x, x) by the Cauchy—Schwarz inequality, and this yields 
the desired implication. Thus V endowed with (-, - ) is an inner product space. 

To make VY into a complete Hilbert space, we introduce in VY a norm defined by || f — g|| = 
Vf — 8, f — 8). 

By Lemma 12.1, all Cauchy sequences in VY have a pointwise limit. Let H be the set 
obtained by adding to V all the functions g that are pointwise limits of Cauchy sequences 
with respect to this norm. For any f, g € H, define 


(f. 8) = lim (fms 8n) and IF lle = lim | fmll . 
n,m—> oo m—> oo 
where fi, fo, ... and g1, g2,... are Cauchy sequences in V with pointwise limits f and g, 


respectively. It is easy to check that (-, - )z, is well defined (i.e., independent of the choice 
of the sequences fm and g, converging pointwise to f and g) and that it is an inner product 
in H. It is also easy to see that H is a complete space (with respect to ||-||7,) in which V is 
dense. Hence 7 is an Hilbert space. 

To conclude the proof, we define the mapping ¢ : R? > H by ¢(x) = K (x, -). Then, 
the reproducing property ensures that K (u, v) = (ġ(u), d(v)). E 


Note that the identity @(x) = K (x, -) provides a representation of the mapping ¢ directly 
in terms of the kernel function K. 


Remark 12.1. In the proof of Theorem 12.6 we obtain the same characterization when 
IR¢ is replaced with an arbitrary set S. Hence, kernels may be more generally defined as 
functions K : S x S — R where no assumptions are imposed on S (e.g., S can be a set of 
combinatorial structures such as sequences, trees, or graphs). Since any kernel K defines a 
metric d in H by 


dls, s’) = ||(s) — 66s.) || = VK, s) + K (s', s) — 2K (s, s”), 


we may view a kernel as a way to embed an arbitrary set of objects in a metric space. 
We now state and prove Lemma 12.1, which we used in the proof of Theorem 12.6. 
Lemma 12.1. For any sequence (fi, f2, ...) of elements of V, if 

im sup Il.fm — fall = 9 


(i.e., the sequence is a Cauchy sequence), then g = limy-+oo fn, defined by g(x) = 
lim, 00 fn(x) for all x € R¢, exists. 


Proof. Fix x and consider the sequence ( Six), fo(x), .- Je Note that 


fn) — fl = fn — fa, K(X) < Wf — fall VK x), 
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where we used the reproducing property in the first step and the Cauchy—Schwarz inequal- 
ity. Thus, because (f1, f2,...) is a Cauchy sequence (fi (x), fo(x),.. :) also is a Cauchy 
sequence. By the Cauchy criterion, every such sequence on the reals has a limit. E 


Kernels of the form K (u, v) = (1 + u- v) fork € N are appropriately called polynomial 
kernels. A closely related kernel is the homogeneous polynomial kernel K (u, v) = (u- v}. 
An infinite-dimensional extension of the homogeneous polynomial kernel is the exponential 
kernel K (u, v) = exp(u- v/o?) for o > 0. The Taylor expansion 


(u. v} 


CO 
exp(u- v) = 5: T 
k=0 i 


reveals that exponential kernels are linear combinations of infinitely many homogeneous 
polynomial kernels, where the coefficients of the polynomials decrease exponentially with 
the degree. By enforcing ||¢(x)|| = 1 or, equivalently, K (x, x) = 1, the exponential kernel 
is transformed as follows: 


K (u, v) explu - v/o?) 


a = exp(— llu — vll? /20°). 
~K (u, WK (v, v) „explu - u/o2) exp(v - v/o?) eee alse") 


This is the gaussian kernel, widely used in pattern classification. The classifier constructed 
by the Perceptron algorithm run with a gaussian kernel corresponds to a weighted mixture 
of spherical gaussians with equal variance and centered on a subset of the previously seen 
instances. Linear classifiers in the feature space defined by gaussian kernels have often 
been called radial basis function (RBF) networks. 


Mistake Bounds and Computational Issues 

The mistake bounds shown in Section 12.2 extend naturally to kernels. Consider, for 
instance, the second-order Perceptron run with a generic kernel function K in a reproducing 
kernel Hilbert space H. Pick any sequence (Xj, y1),..-, (Kn, Yn) € R? x {—1, 1} and let 
the cumulative hinge loss of any function f € H on this sequence be defined by 


n 


Lya =Y (y = y FED) 


t=1 


Then the number of mistakes made by the second-order Perceptron is bounded as 


n ; TP) l n n 
> 2 < 2 = 2 $ 
= lon S a Ifi=1 is y p 2 FR) 2 ma PAD J 


Y i=1 


where the numbers À; are the eigenvalues of the kernel matrix with entries K (x;, x;) for 
i,j=l,...,n. 

Note that if a linear kernel K (x;, xj) = x) x j is used, so that f(x) = ul x for some 
u € Rf, then the mistake bound of Theorem 12.3 (for the choice ||u|| = 1) is recovered 
exactly. To see this let A = x; x] +--+ +X, X, and observe that 


n 
u'Au= ) (ux). 


t=1 
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Also, the nonzero eigenvalues of the matrix A coincide with the nonzero eigenvalues of the 
kernel matrix. 

We close this section by noting that the kernel-based version of the Perceptron uses 
space ©(m) to store a linear classifier and time O(m) to update it, where m is the number of 
mistakes made so far. The second-order Perceptron, instead, uses space @(m7) for storing 
and time @(m?) for updating (this can be shown via simple linear algebraic identities 
about the update of inverse matrices). Thus, kernel-based forecasters have space and time 
requirements that grow with the number of mistakes. An interesting thread of research is the 
design of principled techniques allowing to trade off a reduction of space requirements with 
a moderate increase in the number of mistakes. The label efficient analysis in Section 12.4 
is an example of this approach. 


12.6 Bibliographic Remarks 


Perceptrons, introduced by Rosenblatt [249] as an attempt to model “the capability of higher 
organisms for perceptual recognition, generalization, recall, and thinking,” are among the 
earliest examples of learning algorithms. Versions of the Perceptron convergence theorem 
were proved by Rosenblatt [250], Block [31], and Novikoff [225]. p-Norm Perceptrons 
were introduced and analyzed in the linearly separable case by Grove, Littlestone, and 
Schuurmans [133], as a special case of their quasi-additive classification algorithm (see also 
Warmuth and Jagota [305] and Kivinen and Warmuth [183]). Generalization of this analysis 
to sequences that are not linearly separable was proposed by Freund and Schapire [114], 
Gentile and Warmuth [125], and Gentile [124]. Perceptrons with dynamic tuning were 
considered by Graepel, Herbrich, and Williamson [132]. 

The Winnow algorithm was introduced by Littlestone [200] as an alternative to Percep- 
tron. Just as the Perceptron algorithm is the counterpart for classification of the Widrow— 
Hoff rule used in regression, the version of Winnow presented here is the classification 
version of the exponentiated gradient algorithm of Kivinen and Warmuth (see the biblio- 
graphic remarks in Chapter 11). Recalling the discussion at the end of Section 11.4, we 
may conclude that Winnow should perform better than Perceptron on data sequences that 
have dense instance vectors and are well approximated by sparse linear experts. In fact, 
Winnow was originally proposed for boolean side information, x, € {0, 1}, and for an 
expert class properly contained in the class of linear experts: the class of all monotone 
k-literal disjunction experts. Each such expert is defined by a subset of at most k coordi- 
nates, and its prediction on x, € {0, 1}¢ is 1 if and only if these k coordinates have value 
1 in x. As shown by Littlestone [200], if the data sequence is perfectly classified by some 
k-literal disjunction expert, then Winnow makes at most O (k Ind) mistakes. On the other 
hand, Kivinen, Warmuth, and Auer [184] show that there are boolean data sequences of the 
same type on which the Perceptron algorithm makes (2(kd) mistakes. For extensions and 
applications of the p-norm Perceptron to classification of k-literal disjunctions, see also 
Auer and Warmuth [15], Gentile [124], Littlestone [201]. 

The second-order Perceptron was introduced by Cesa-Bianchi, Conconi, and Gen- 
tile [47], who also studied variants using the pseudoinverse of x, Xi +e +X x" rather 
than the inverse of J +x; x] +--+ +x,x/. 

Forecasting strategies converging to a separating hyperplane with maximum margin 
have been proposed by several authors. A remarkable example is the Adatron of Anlauf 
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and Biehl [8]. However, finite-time convergence results for approximate maximum margin 
hyperplanes, such as the analysis of ALMA in Section 12.3, have been proposed only recently. 
Such results include the relaxed maximum margin online algorithm of Li and Long [199] 
and the margin infused relaxed algorithm of Crammer and Singer [75]. The ALMA algorithm 
and Theorem 12.4 are due to Gentile [123]. Support vector machines (Svs), an effective 
classification technique originally introduced by Vapnik and Lerner [294] (under a different 
name), find the maximum margin hyperplane at once by solving an optimization problem 
defined over the entire sequence of examples. In their modern form, svMs were introduced 
by Boser, Guyon, and Vapnik [38] and Cortes and Vapnik [67]. See the monographs 
Cristianini and Shawe-Taylor [76], Schélkopf and Smola [262], and Vapnik [292] for 
extensive accounts on the theory of svMs. 

The label efficient forecasters presented in Section 12.4 were introduced and analyzed 
by Cesa-Bianchi, Gentile, and Zaniboni [49]. However, similar techniques aimed at saving 
labels have been extensively studied in pattern recognition. See, for instance, the pioneering 
paper of Cohn, Atlas, and Ladner [66], the query by committee algorithm of Freund, Seung, 
Shamir, and Tishby [116], and the more recent approaches of Campbell, Cristianini, and 
Smola [45], Tong and Koller [289], and Bordes, Ertekin, Weston, and Bottou [35]. 

The study of reproducing kernel Hilbert spaces was developed by Aronszajn [9] in the 
1940’s. The use of kernels has been introduced in learning since 1964 with the influen- 
tial work of Aizerman, Braverman, and Rozonoer [1—3] and Bashkirov, Braverman, and 
Muchnik [23] (see also Specht [277]). 

However, it took almost 30 years before the potentialities of kernels began to be fully 
understood with the paper of Boser, Guyon, and Vapnik [38]. The books of Schélkopf and 
Smola [262] and Cristianini and Shawe-Taylor [77] are two excellent monographs on learn- 
ing with kernels. The proof of Theorem 12.6 is taken from Saitoh [255]. Kernel perceptrons 
were considered by Freund and Schapire [114]. The kernel second-order Perceptron is due 
to Cesa-Bianchi, Conconi, and Gentile [47]. 


12.7 Exercises 


12.1 (Perceptron with time-varying learning rate) Extend Theorem 12.1 to prove that the Percep- 
tron (i.e., the p-norm Perceptron with p = 2) with learning rate A, = 1/ ||x,|| achieves, on any 
sequence (x1, y1), (X2, y2)... € R? x {—1, 1}, and for all y > 0 and u € Rf, the bound 


7 L0) luy luly? 2,0) 
X Iian = Z + (=) + hae tan a 
t=1 


4 Y Y y 

where L, n0) =y (y — y,u-x,/ IIxil) , is the normalized cumulative hinge loss. 

12.2 (p-Norm perceptron for the absolute loss) By adapting the self-confident linear forecaster 
with polynomial potential introduced in Section 11.5, derive a forecaster for the absolute loss 
(p, y) = ilp — y|, where p € [—1, +1] and y € {—1, 1}. Prove a bound on the absolute loss 
of this forecaster in terms of the hinge loss L,,„ (u) of the best linear forecaster u with g-norm 
bounded by a known constant. Set the hinge y to 1 (Auer, Cesa-Bianchi, and Gentile [13]). 
Warning: This exercise is difficult. 

12.3 (Learning r-of-k threshold functions) An r-of-k threshold functions is a function f : 
{0, 1}4 > {-1, 1} specified by k relevant attributes indexed by i1,...,i, E€ {1,..., d}. 
On any xe {0, 1%, f(x) =1 if and only if xa +-+- +x; >r. Given a sequence 


12.4 


12.5 


12.6 


12.7 


12.8 


12.9 
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(X1, y1), - + +5 (Xn, Yn) € {0, 1}¢ x {—1, 1}, the attribute error Ay on the sequence is af; + 
+++ + yn, where ay; is the minimum number of components of x, that have to be changed to 
ensure that f(x,) = y,. 

Prove a mistake bound for the p-norm Perceptron on sequences over {0, 1}? x {—1, 1} 
such that the number of mistakes is bounded in terms of the attribute error A; of an arbitrary 
r-of-k threshold function f. Investigate what happens to the bound when p is set to 2 In(d + 1) 
(see Gentile [124]). 

(Parameterized second-order Perceptron) Consider the parameterized second-order fore- 
caster defined using 


t—1 

m 

A =al + > LARS 
s=1 


where a > 0 is a free parameter. Hence, the parameterless second-order Perceptron corre- 
sponds to the setting a = 1. Prove an analog of Theorem 12.3 for this variant. Investigate 
different choices of the parameter a. What happens to the mistake bound for a > oo? 
(Second-order vs. classical Perceptron) Show that there exists a choice of a such that inequal- 
ity (12.4) is satisfied when a < 1/(2k), where k is the number of nonzero eigenvalues of A,, 
(Cesa-Bianchi, Conconi, and Gentile [47]). 

(Proofs via the Blackwell condition) Consider the conservative classifiers introduced in 
Section 12.2. Observe that the weight vectors used by these classifiers can be equivalently 
defined using w, = V ®(R,), where R, is the cumulative “regret” 


t t 
R =) r =— >) Veys(ws-1) 
yal s=l 


and ¢,,;(Ws—1) is the hinge loss (y — ys Ws—1 - Xs)+. Verify that the Blackwell condition 
sup r,- VO(R,-1) <0 
ywre{-1,1} 


holds for this definition of regret. Then use Corollary 2.1 to derive the same mistake 
bound shown in Theorem 12.1. Hint: Use Corollary 2.1 to upper bound ®,(R,,) in terms 
of X; Tjs,4y,}, and then use Hélder’s inequality to show the lower bound 


IRI, = y DSUs — DS 40) 
t=1 t=1 


for u € R’ arbitrary (Cesa-Bianchi and Lugosi [54]). 


(ALMA on arbitrary sequences) Suppose ALMA is run on an arbitrary sequence (X1, yi), ..., 
(Xn, Yn),-.. € R? x {—1, 1}. Prove a bound on the number of updates of the form 


PE L,(u) x as a c& |L, u) 
y: y Y Y 


for any u € R’ with |jul| = 1 and for any y > 0. Note: The bound does not depend on a, 
but you might have to change the constants in the definition of y, and n, in order to prove it 
(Gentile [123]). 

(p-Norm ALMA) Prove a version of Theorem 12.4 using a modified p-norm Perceptron 
(Gentile [123]). 

(Label efficient Winnow) Adapt the proof of Theorem 12.2 to show that the label efficient 
version of Winnow, querying label Y, with probability c/(c + |p;|), and run with parame- 
ters n = 2ay/X2, and c = (1 —a)y for some 0 < œ < 1, achieves an expected number of 
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mistakes satisfying 
n 2 
E yom, P 1 Lyn) (=) Ind 
ar l-a y y 2a(1 — æ) 


for all u € R? in the probability simplex (Cesa-Bianchi, Lugosi, and Stoltz [55] and Cesa- 
Bianchi, Gentile, and Zaniboni [49]). 


(Label efficient second order Perceptron) Adapt the proof of Theorem 12.3 to show that 
the label efficient version of the second-order Perceptron, querying label Y, with probability 
c/(c + |p; |), achieves an expected number of mistakes satisfying 


n d 
Ly nU) c 1 
E È m| = eres a yp (llul? +u' Au) + F J In(1 + à;) 
t=1 i=l 


for any choice c > 0 of the input parameter and for all u € R? and y > 0. 


d+k 
k 


(Second-order Perceptron in dual variables) Show that the second-order Perceptron clas- 
sification at time ¢ can be computed using only inner product operations between instances 
Xis 005 Xp. 


Show by induction that a kth-degree surface in R@ is specified by ( ) coefficients. 


(All subsets kernel) Find an easily computable kernel for the mapping ¢ : R? > R” defined 
by (x) = (x4)a, where x/, = Tica x; and A ranges over all subsets of {1, ..., d} (Takimoto 
and Warmuth [286]). 

(ANOVA kernel) Consider the mapping ¢ such that, for any x € R, d(x) = (x/,)a, where 
x), =J],<4%; and A ranges over all subsets of {1, . . . , d} of size at most k for some fixed k = 
1,..., d. Direct computation of @(u) - #(v) takes time order of d*. Use dynamic programming 
to show that the same computation can be performed in time O (kd) (Watkins [306]). 


Appendix 


In this appendix we collect some of the technical tools used in the book and not proved in 
the main text. Most of the results reproduced here are quite standard; they are here to make 
the book as self-contained as possible. Here we take a minimalist approach and stick to the 
simplest possible versions that are necessary to follow the material in the main text. This 
appendix should not be taken as an attempt to an exhaustive survey. The cited references 
merely intend to point to the original source of the results. 


A.1 Inequalities from Probability Theory 


A.1.1  Hoeffding’s Inequality 
First we offer a proof of Lemma 2.2, which states the following: 


Lemma A.1. Let X be a random variable with a < X < b. Then for anys € R, 


why _ 72 
In s[e] <s x A, 


Proof. Since InE |e’ | 
variable X with E X = 0, 


< sEX +InE[e**—**)], it suffices to show that for any random 
a 


~ [e*] < es b-ay/8_ 


Note that by convexity of the exponential function, 


x—a b-x 
e™ < esei e“! fora < x < b. 
b-a b-a 


Exploiting E X = 0, and introducing the notation p = —a/(b — a), we get 


reS X < b ed a sb 
~ b-a b—a 
© (1 -p 4 per?) oe Ps(b-a) 
def 
def po, 
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where u = s(b — a), and @(u) = — pu + log(1 — p + pe"). But by straightforward calcu- 
lation it is easy to see that the derivative of ¢ is 


p 
p+(1-— p)e™ 


o'(u)=—pt+ 
and therefore #(0) = ¢’(0) = 0. Moreover, 


pA — pe~ z] 


EUS (p+(1— pe} 7 4 


Thus, by Taylor’s theorem, 


u? a s*(b — ay? 


2 
pu) = 60) +u¢'(0) + STAO) ZS - 


for some 0 € [0, u]. E 


Lemma A.1 was originally proven to derive the following result, also known as Hoeffd- 
ing’s inequality. 


Corollary A.1. Let X\,..., Xn be independent real-valued random variables such that for 
eachi = 1,...,n there exist some a; < b; such that Pla; < Xi < bi] = 1. Then for every 
e>0, 


n n 2 
apa -E X; > J <ar- —y) 
ist : 


i=l i=1 


and 


n n 2 
[$x - Son <-«| seme (-y 2). 
i=1 Vi i 


i=1 i=1 


Proof. The proof is based on a clever application of Markov’s inequality, often referred 
to as Chernoff’s technique: for any s > 0, 


k , E [exp(s X (Xi — DX ;)) | 
P ba — 1X;) > ] < exp(st) 


_ Tei E[exp(s(X; — EX;))] 
E exp(st) 


where we used independence of the variables X;. Bound the numerator using Lemma A.1 
and minimize the obtained bound in s to get the first inequality. The second is obtained by 
symmetry. W 


We close this section by a version of Corollary A.1, also due to Hoeffding [161], for 
the case when sampling is done without replacement. 
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Lemma A.2. Let the set A consist of N numbers a,,...,ay. Let Zi, ..., Zn denote a 
random sample taken without replacement from A, where n < N. Denote 

N 

m= — ai and c= max |a; — ajl]. 

Wy oat max |a; — ajl 


i=1 


Then for any £ > 0 we have 


| 


For more inequalities of this type, see Hoeffding [161] and Serfling [264]. 


1 n 
DI 


i=1 


Jne? Ic? 
ze|s2 2ne fct 


A.1.2 Bernstein’s Inequality 
Next we present inequalities that, in certain situations, give tighter bounds than Hoeffding’s 
inequality. The first result is a simple “poissonian” inequality. 


Lemma A.3. Let X be a random variable taking values in [0, 1]. Then, for any s € R, 


In z [e°*] < (e° — 1) X. 


Proof. As in the proof of Hoeffding’s inequality, we exploit the convexity of e** by 
observing that for any x € [0, 1], e°% < xe" + (1 — x). Thus, 


[e *] <EXe’ +1-EX. 


By the elementary inequality 1+x <e* we have EXe’+1—EX < e®-DEX as 
desired. W 


The next inequality is a version of Bernstein’s inequality [25]; see also Freedman [110], 
Neveu [224]. 


Lemma A.4. Let X be a zero-mean random variable taking values in (—oo, 1] with variance 
z X? = 0°. Then, for any n > 0, 


InEe™ <o*(e"7—1—n). 


Proof. The key observation is that the function (e* — x — 1)/x? is nondecreasing for all 
x € R. But then, since X < 1, 


eV — nX — 1 < X™(e"— n- 1). 


Taking expected values on both sides, taking logarithms, and using In(1 + x) < x, we 
obtain the stated result. W 


A simple consequence of Lemma A.4 is the following inequality. 


Lemma A.5. Let X be a random variable taking values in [0, 1]. Leto = yE X? — (EX)?. 
Then for any n > 0, 


In E [e "~ O] < o? (e”— 1- n) < EX(1—EX)(e”—1—7). 
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Proof. The first inequality is a direct consequence of Lemma A.4. The second inequality 
follows by noting that, since X € [0, 1], 


o? =EX*-(EXY <EX—-(EXY =EX(1—-EX). E 


Using Lemma A.4 together with Chernoff’s technique as in the proof of Corollary A.1, it 
now is easy to deduce the following result. 


Corollary A.2 (Bennett’s inequality). Let X,,..., X, be independent real-valued random 
variables with zero mean, and assume that X; < 1 with probability 1. Let 


1 n 
o? = DD X? 


Then for any t > 0, 


P È Xi > r| < exp (-no?n (=) ; 


i=1 


where h(u) = (1 + u) log + u) — u for u > 0. 


The message of this inequality is perhaps best seen if we do some further bounding. 
Applying the elementary inequality A(u) > u? /(2 + 2u/3), u > 0 (which may be seen by 
comparing the derivatives of both sides), we obtain a classical inequality of Bernstein [25]. 


Corollary A.3 (Bernstein’s inequality). Under the conditions of the previous theorem, for 
any € > 0, 


P D > ne? 
n EEE 202 +2e/3)` 


i=l 


A.1.3 Hoeffding—Azuma Inequality and Related Results 
The following extension of Hoeffding’s inequality to bounded martingale difference 
sequences is simple and useful. 

A sequence of random variables V1, V2,...is a martingale difference sequence with 
respect to the sequence of random variables X1, X2,...if, for every i > 0, V; is a function 
of X;,..., Xi, and 


E[Vi+1|X1,-..,Xi;]=0 with probability 1. 


Lemma A.6. Let Vi, V2,...be a martingale difference sequence with respect to some 
sequence X1, X2,...such that V; € [A;, Aj + ci] for some random variable A;, measur- 
able with respect to X,..., Xi—1, and a positive constant ci. If Sk = yar Vj, then for any 
s >0, 


l [e] < eD Eme, 


A.l Inequalities from Probability Theory 363 


Proof. 


J [e] =E [ese y [e | Kjaere || 
< F [errr] 


= er lB 7 [e=] , 


where we applied Lemma A.1. The desired inequality is obtained by iterating the argu- 
ment. W 


Just as in the case of Corollary A.1, we obtain the following corollary. 


Lemma A.7. Let Vi, V2,...be a martingale difference sequence with respect to some 
sequence X1, X2,...such that Vi € [A;, Ai + ci] for some random variable Aj, measur- 
able with respect to X,,..., Xi, and a positive constant ci. If Sn = )~;_, Vi, then for any 
t>0, 


—2r? 
P[S, > t] < exp Sog 
i=1 Ĉi 


and 


—2¢? 
P[S, < —t] < exp (<3) : 


i=1 Ĉi 


In fact, as noted in [161], the following “maximal” version of Lemma A.7 also holds: 


—21? 
P max S; >t] <exp Le : 
= i=l Ĉi 


We also need the following “Bernstein-like” improvement that takes variance information 
into account (see Freedman [110]). The proof, which we omit, is an extension of the 
independent case just shown. 


Lemma A.8 (Bernstein’s inequality for martingales). Let Xı,..., Xn be a bounded 
martingale difference sequence with respect to the filtration F = (Fj), <i<n and with |X;| < 
K. Let 


be the associated martingale. Denote the sum of the conditional variances by 


oy = 2 E [X | Fa]. 


Then for all constants t, v > 0, 


sas 


2 
P| max Si >t and 2 < v < exp | -= ] , 
i=l... 2(v + Kt/3) 
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and therefore, 


aes 


A.1.4 Khinchine’s Inequality 
Recall Lemma 8.2: 


Lemma A.9. Let a\,..., dy, be real numbers, and let 01, ..., Oy be i.i.d. sign variables 
with P[o, = 1] = Ploy = —1] = 1/2. Then 


Here we give a short and elegant proof (with a suboptimal constant 1/./3 instead of 1//2) 
due to Littlewood [204]. First note that for any random variable X with finite fourth moment, 


, 3/2 
1 |X| > ee 
Ex” 


Indeed, by Hölder’s inequality, 


7X2 = [1X143 [X [7/7] < ( oF Ga as | zx’. 


Applying this inequality for X = }`;_; ajo; gives 
(hia) 
E =1 Îi +3 Laa 


2 
where we used $ `;_] af +3504; 4703 <3 (X 147). 


n 


J 2 aj 0; 


A.1.5 Slud’s Inequality 
Here we recall, without proof, an inequality due to Slud [272] between binomial tails and 
their approximating normals. 


Lemma A.10. Let B be a binomial (n, p) random variable with p < 1/2. Then for n(1 — 
p)2=k2>np, 


PIB > k12 PÍN > vam | 


~ np — p) 


where N is a standard normal random variable. 
A.1.6 A Simple Limit Theorem 


Lemma A.11. Let {Zj} be i.i.d. Rademacher random variables (i =1,...,N3;t= 
2,...) with distribution P[Z;. = —1] = P[Z;: = 1] = 1/2, and let Gi,...,Gy be 


A.l Inequalities from Probability Theory 365 


independent standard normal random variables. Then 


n 
1 
lim | max F2 Zit | =E| max G;]. 
noo i=1,...,N n i=1,...,N 
t=1 


Proof. Define the N -vector X, = (Xn,1,.--, Xn,n) of components 


n 
det 1 
Xni = =) Zits iS eee 
vn is 


By the “Cramér—Wold device” (see, e.g., Billingsley [27, p. 48]), the sequence of vectors 
{X,} converges in distribution to a vector random variable G = (G,,...,Gwy) if and 
only if ae aiX n, i converges in distribution to SG a;G; for all possible choices of the 
coefficients a1, . . . , ay. Now clearly, 5 Si a;X n į converges in distribution, as n —> oo, to 
a zero-mean normal random variable with variance et a?. Then, by the Cramér—Wold 
device, as n — oo the vector X, converges in distribution to G = (G1, ..., Gx), where 
G,..., Gx are independent standard normal random variables. 

Convergence in distribution is equivalent to the fact that for any bounded continuous 
function y : R" > R, 


Jim E[W(Xn.1,---+Xnn)] =E[WG,..-,Gy)]- (A.1) 
Consider, in particular, the function Y (x1, ..., Xy) = L (max; x;), where L > 0, and ġ;z is 


the “thresholding” function 


—-L ifx < —L, 
La) =;į x ifẹx|< L, 
L ifx>L. 


Clearly, ġz is bounded and continuous. Hence, by (A.1), we conclude that 


ei j [o (e, Xna) = |o: (e, a) ' 


a heer N 


| ~ (x + _ max Xn) Limax; nse N tun 
i=1,...,N 


+E Ie + max Xn) Tmax;=,... ru ; 


where 


= (aan, Xn -1) T maxiat ahs asa] 
CO 
=) p | | max Xn, > Lu) au 
0 ave 
[0,6] 
al | max X», ; > u |ou 
L 1,...,.N 
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Q9 2 
< 2N / e™ 1? du 
L 
(by Hoeffding’s inequality; see Corollary A.1) 


ge 1 —u? /2 
L u 


2N 
= 2 eE, 
Therefore, we have, for any L > 0, 


oe she A ~ 2N —L?/2 
liminf E| max Xni| 2 Eļ|¢ġz{ max G; )|— —e ; 
i N i N L 


n= 


Letting L — œ on the right-hand side, and using the dominated convergence theorem, we 
see that 


liminf E | max X,,; | > E| max G; |. 
n>0oo 1,...,.N i 1,...,N 


t 2 a Lee 


The proof that 


w 


is similar. W 
For a proof of the next result see, for example, Galambos [122]. 


Lemma A.12. Let G,,...,Gy be independent standard normal random variables. Then 


lim 


Noo /21nN 


The following lemma is a related nonasymptotic inequality for maxima of subgaussian 
random variables. 


Lemma A.13. Let o > 0, and let X,,..., Xn be real-valued random variables such that 
for all. > Oand 1 <i < N,E[e**'] < e? Then 


Í, max, x | <ov2InN. 


Proof. By Jensen’s inequality, for all A > 0, 


MOR 


IA 
— 
nr 
= 
2 
— 
IA 
2 
n 
= 
N 
a 
N 
~ 
N 


i=l 


A.l Inequalities from Probability Theory 367 


Thus, 


and taking A = ,/2InN/o? yields the result. Ml 


A.1.7 Proof of Theorem 8.3 
The technique of the proof, called “chaining,” is due to Dudley [91]. 

For each k = 0, 1,2,..., let F® be a minimal cover of F of radius D2~*. Note that 
\F®|=N oF, D2-*). Denote the unique element of F® by fo. 

Let {2 be the common domain where the random variables Ty, f € F are defined. 
Pick w € Q and let f* € F be such that sup fF T;(@) = Ty+(@). (Here we implicitly 
assume that such an element exists. The modification of the proof for the general case is 
straightforward.) 

For each k > 0, let fě denote an element of F (© whose distance to f* is minimal. 
Clearly, o(f*, f°) < D2-*, and therefore, by the triangle inequality, for each k > 1, 


efi fe) S OCF, FE) + OF", fia) S 3D2*. (A.2) 
Clearly, limy_,o. f = f*, and so by the sample continuity of the process, 


T p(w) = Tpe(@) = Tro) + X (Tlo) -Tr ,(@)) - 
k=1 


Therefore 


CO 
T J Te: —T, 
Ẹ | < 2 [max ( f ol: 


where the max is taken over all pairs (f, g) € F® x FEI such that p(f, g) < 3D2~. 

Noting that there are at most N, (F, D2-*) of these pairs, and recalling that (Tp: fie 
F} is subgaussian in the metric p, we can apply Lemma A.13 using (A.2). Thus, for each 
k> 1, 


s [max (Ty — To)| < Deo InN (F, D2}. 


Summing over k, we obtain 


A 


CO 
E [wer < > 3D2*/2 InN ,(F, D2-*2 
k=1 


CO 
12 5 D2-*). In N, (F, D2-*) 


k=1 


D/2 
2f yla N (F, €)de, 
0 


IA 


as desired. W 


368 Appendix 


A.1.8 Rademacher Averages 


Let A € R” be a bounded set of vectors a = (a1, . . . , a), and introduce the quantity 
RAD = | sup + > 
n = sup — Oidi |, 
Bey n i=l 
where 0), ..., On are independent random variables with P[o; = 1] = Plo; = —1] = 1/2. 


R,(A) is called the Rademacher average associated with A. R (A) measures, in a sense, 
the richness of set A. 

Next we recall some of the simple structural properties of Rademacher averages. Observe 
that if A is symmetric in the sense that a € A implies —a € A, then 


n 
J Oidi x 
i=1 


Let A, B be bounded symmetric subsets of R” and let c € R be a constant. Then the 
following subadditivity properties are obvious from the definition: 


R,(A)=E sup’ 


aca 1 


R, (AUB) < R,(A) + R, (B), 
R,(c - A) = |c|Rn(A), 
R,(A ® B) < R,(A) + R,(B), 


where c - A = {ca :a € A}and A @B = {a+b :a c€ A,b eB}. It follows from Hoeffd- 


ing’s inequality (Lemma A.1) and Lemma A.13 that if A = {a®, ... , a} c R’ isa finite 
set, then 
ay a/2log N 
R,(A) < max aP] VSEL, (A.3) 
j=1,..,N n 


Finally, we mention two important properties of Rademacher averages. The first is that 
if absconv(A) = { Jci cja? :NEN, DR Icj| < 1, ae A} is the absolute convex 
hull of A, then 


R,(A) = Rn (absconv(A)), 


as is easily seen from the definition. The second is known as the contraction principle: let 
@:R— R bea function with (0) = 0 and Lipschitz constant Ly. Defining ¢ o A as the 
set of vectors of form (#(a1), ..., (Gn)) € R” witha € A, we have 


R, ($ © A) < LR, (A). 


(see Ledoux and Talagrand [192]). Often it is useful to derive further upper bounds on 
Rademacher averages. As an illustration we consider the case when A is a subset of 
{—1, 1}”. Obviously, |A| < 2”. By inequality (A.3), the Rademacher average is bounded 
in terms of the logarithm of the cardinality of A. This logarithm may be upper bounded in 
terms of a combinatorial quantity, called the vc dimension. If A C {—1, 1}”, then the vc 


dimension of A is the size V of the largest set of indices {i,,...,iy} C {1,...,} such that 
for each binary V -vector b = (b1, ..., by) € {-1, 1}” there exists an a = (a1, ..., an) € 
A such that (a;i, ..., aip) = b. The key inequality establishing a relationship between 


shatter coefficients and vc dimension is known as Sauer’ s lemma (proved independently 
by Sauer [261], Shelah [266], and Vapnik and Chervonenkis [293]) which states that the 
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cardinality of any set A C {—1, 1}” may be upper bounded as 


V 
als D (p) s+" 


i=0 
where V is the vc dimension of A. In particular, for any A C {—1, 1}”, 


R,(A) < V2V log(n + 1) 
paa; n y 


This bound is a version of what has been known as the Vapnik—Chervonenkis inequality. 
By a somewhat refined analysis (based on chaining, very much in the spirit of the proof 
of Theorem 8.3), the logarithmic factor can be removed, and this results in a bound of the 


form 
V 
R, (A) < C4 — 
n 


for a universal constant C (Dudley [91]; see also Lugosi [206]). 


A.1.9 The Beta Distribution 
A random variable X taking values in [0, 1] is said to have the Beta distribution with 
parameters a, b > 0 if its density function is given by 


xt 1 nen x)?! 
B(a, b) 


where B(a,b)=T(a)(b)/T(a,b) is the so-called Beta function. Here V(a) = 
a x4—!e-* dx denotes Euler’s Gamma function. 

Let X have Beta distribution (a, b) and consider a random variable B such that, given 
X =x, the conditional distribution of B is binomial with parameters n and x. Then the 
marginal distribution of B is calculated, for k = 0, 1, ..., n, by 


f@)= 


xt lq = x)?! 
B(a, b) 


n f xta] _ xy -4tb-1 
— dx 
k) Jo B(a,b) 


_ OS 
Nk B(a, b) 


1 
PIB = k= f PIB =k |X =x] 
0 


Now it is easy to determine the conditional density of X given B = k: 


f(OP[B =k | X = x] 
PIB =k] 
xtl T DP Exa = xy 
(()Bk+a,n—k +b) 
xtti] = gyo 


B(k+a,n—=k+b) ` 


fx |B=h= 


This is recognized as a Beta distribution with parameters k + a and n — k + b. 
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A.2 Basic Information Theory 


In this section we summarize some basic properties of the entropy of a discrete-valued 
random variable. For an excellent introductory book on information theory we refer to 
Cover and Thomas [74]. 

Let X be a random variable taking values in the countable set X with distribution 
P[X = x] = p(x), x € X. The entropy of X is defined by 


H(X) = E|- log p(X)] = — X` po) log p(x) 
xEX 


(where log denotes natural logarithm and 0 log 0 = 0). If X, Y is a pair of discrete random 
variables taking values in Y x V, then the joint entropy H(X,Y) of X and Y is defined as 
the entropy of the pair (X, Y ). The conditional entropy H(X | Y ) is defined as 


H(X | Y) = H(X,Y)-H(). 
If we write p(x, y) = P[X =x, Y = y] and p(x | y) = P[X =x | Y = y], then 
H(X|Y)=— Yo pæ, y)log px |y), 
xeEX,yey 


from which we see that H(X | Y) > 0. It is also easy to see that the defining identity of 
the conditional entropy remains true conditionally, that is, for any three (discrete) random 
variables X,Y, Z: 


H(X,Y | Z)= HY | Z)+ H(X |Y, Z). 


(Just add H (Z) to both sides and use the definition of the conditional entropy.) A repeated 
application of this yields the chain rule for entropy: for arbitrary discrete random variables 
Xı Kra Xa 


H(X1,...,Xn) 
= H(X) + H (X2 | X1) + H(X3 | X1, X2) +--+ + A(X, | X1, 3 Xn-1)- 


Let P and Q be two probability distributions over a countable set X with probability mass 
functions p and q. Then the Kullback—Leibler divergence or relative entropy of P and Q is 


D(PIQ)= J. px)log a 


xEX : p(x)>0 q(x) 


Since logx < x — 1, 


DPIQ=- E rwo- E po) (42-1) 0 


xEX : p(x)>0 px) xEX : p(x)>0 


Hence, the relative entropy is always nonnegative and equals 0 if and only if P = Q. This 
simple fact has some interesting consequences. For example, if X is a finite set with N 
elements, X is a random variable with distribution P, and we take Q to be the uniform 
distribution over X, then D(P||Q) = log N — H(X), and therefore the entropy of X never 
exceeds the logarithm of the cardinality of its range. Another immediate consequence of 
the nonnegativity of the relative entropy is the so-called /og-sum inequality, which states 
that if a1, a2, . . . and bj, b2, ... are nonnegative numbers with A = }_; a; and B = };; b;, 


A.3 Basics of Classification 371 


then 
Yai log > Alo a 
di Tna Ry 
- 2 b; g B 
Let now P and Q be distributions over an n-fold product space X”, and for t = 1,..., n 
denote by 
Pi&i) DO PO., ma) 
Xt Xtp1seXn 

and 


Qai,- X1) = > O(X1,..-5 Xn) 
Xp Xp, e005 Xn 
the marginal distributions of the first t — 1 variables. Then the chain rule for relative entropy 
(a straightforward consequence of the definition) states that 


DPE... OP, lOa) 


t=1 X1,...,X¢-1 
where Pix,,....x,_, denotes the conditional distribution over 1”~‘ defined by 


P(x, oa GR) 


Px 4 XtyXt4+15-++5 Xn) = — 
lester t>At+l> ’ n) P Oi, aaa Seay 


and Q),,,...x,_, 18 defined similarly. 
The following fundamental result is known as Pinsker’s inequality. For any pair of 
probability distributions P and Q, 


1 
YxzPPIQD= 2, (P@)- Oe). 


x: PQ)=O(x) 


Sketch of Proof. First prove the inequality if P and Q are concentrated on the same two 
atoms. Then define A = {x : P(x) > O(x)} and the measures P*, Q* on the set {0, 1} 
by P*(0) = 1 — P*(1) = P(A) and Q*(0) = 1 — Q*(1) = Q(A), and apply the previous 
result. W 


A.3 Basics of Classification 


In this section we summarize some basic facts of the probabilistic theory of binary clas- 
sification. For more details we refer to Devroye, Györfi, and Lugosi [88]. The problem of 
binary classification is to guess the unknown binary class of an observation. An observation 
x is an element of a measurable space X. The unknown nature of the observation is called 
a class, denoted by y, and takes values in the set {0, 1}. 

In the probabilistic model of classification, the observation/label pair is modeled as a 
pair (X, Y ) of random variables taking values in ¥ x {0, 1}. 

The posterior probabilities are defined, for all x € X, by 


n(x) = PY = 1| X =x] =E[Y |X =x]. 


Thus, n(x) is the conditional probability that Y is 1, given X = x. 
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Any function g : ¥ — {0, 1} defines a classifier, and the value g(x) represents one’s 
guess of y given x. An error occurs if g(x) Æ y, and the probability of error for a classifier 


gis 
L(g) = Ple(X) # Y]. 
The next lemma shows that the Bayes classifier given by 


1 ifn(x)>1/2 
0 otherwise 


gœ) = | 
minimizes the probability of error. Its probability of error L(g*) is called the Bayes error. 


Lemma A.14. For any classifier g : X — {0, 1}, 
Ple*(X) AY] < Ple(X) # Y]. 


Proof. Given X =x, the conditional probability of error of any decision g may be 
expressed as 


P[g(X) AY |X =x] 
=1-—P[¥ = 9(X)|X =x] 
=1-— (PIY = 1, 9(X)=1| X = x] + P[Y =0, g(X) =0| X =x]) 
= 1 — (Ieuan PIY = 1 | X = x] + Igo) PIY = 0 | X = x]) 
=1— (Itr 9) + kgo C — n(x))). 
Thus, for every x € X, 
P[g(X) AY |X =x]—Plg*(X) FY | X =x] 
= n(x) (Ttercn=t) — ewn) + (1 = 2) Mews — eao) 
= (2n(X) = 1) (lewn — ewn) 
>0 


by the definition of g*. The statement now follows by taking expected values of both 
sides. W 


Lemma A.15. The Bayes error may be written as 
L(g*) = Plg*(X) 4 Y] = E[min{n(X), 1 — n0}. 
Moreover, for any classifier g, 


L(g) — L(g*) = 2E[ |n(X) — 1/2 Trgoozg y] - 


Proof. The proof of the previous lemma reveals that 


L(g) = 1 — E[ltecxyaty nX) + Tyger =o} (1 — 0(X))] 


and, in particular, 
L(g*) = 1 — E[Igaxy>1/2) nX) + [nexyei/2y (1 — n(X))]. 


The statements are immediate consequences of these expressions. W 
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